Re: Join on DataFrames from the same source (Pyspark)

ayan guha Tue, 21 Apr 2015 09:34:00 -0700

you are correct. Just found the same thing. You are better off with sql,
then.


userSchemaDF = ssc.createDataFrame(userRDD)

    userSchemaDF.registerTempTable("users")
    #print userSchemaDF.take(10)

#SQL API works as expected

    sortedDF = ssc.sql("SELECT userId,age,gender,work from users order by
work,age")
    #print sortedDF.take(10)
    df1 = userSchemaDF.select("userId","age")
    df1.registerTempTable("df1")
    df2 = userSchemaDF.select("userId","work")
    df2.registerTempTable("df2")

    dfjs = ssc.sql("select age,work from df1 d1 inner join df2 d2 on
d1.userId=d2.userId")
    print dfjs.count()

#DF API does not
    dfj = df1.join(df2,df1.userId==df2.userId,"inner")
    print dfj.count()

943
889249

On Wed, Apr 22, 2015 at 1:10 AM, Karlson <ksonsp...@siberie.de> wrote:

> Sorry, my code actually was
>
>     df_one = df.select('col1', 'col2')
>     df_two = df.select('col1', 'col3')
>
> But in Spark 1.4.0 this does not seem to make any difference anyway and
> the problem is the same with both versions.
>
>
>
> On 2015-04-21 17:04, ayan guha wrote:
>
>> your code should be
>>
>>  df_one = df.select('col1', 'col2')
>>  df_two = df.select('col1', 'col3')
>>
>> Your current code is generating a tupple, and of course df_1 and df_2 are
>> different, so join is yielding to cartesian.
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonsp...@siberie.de> wrote:
>>
>>  Hi,
>>>
>>> can anyone confirm (and if so elaborate on) the following problem?
>>>
>>> When I join two DataFrames that originate from the same source DataFrame,
>>> the resulting DF will explode to a huge number of rows. A quick example:
>>>
>>> I load a DataFrame with n rows from disk:
>>>
>>>     df = sql_context.parquetFile('data.parquet')
>>>
>>> Then I create two DataFrames from that source.
>>>
>>>     df_one = df.select(['col1', 'col2'])
>>>     df_two = df.select(['col1', 'col3'])
>>>
>>> Finally I want to (inner) join them back together:
>>>
>>>     df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'],
>>> 'inner')
>>>
>>> The key in col1 is unique. The resulting DataFrame should have n rows,
>>> however it does have n*n rows.
>>>
>>> That does not happen, when I load df_one and df_two from disk directly. I
>>> am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>


-- 
Best Regards,
Ayan Guha

Re: Join on DataFrames from the same source (Pyspark)

Reply via email to