Re: Question regarding join with multiple columns with pyspark

Dan LaBar Thu, 13 Aug 2015 10:24:01 -0700

The DataFrame issue has been fixed in Spark 1.5.  Refer to SPARK-7990
<https://issues.apache.org/jira/browse/SPARK-7990> and Stackoverflow: Spark
specify multiple column conditions for dataframe join
<http://stackoverflow.com/a/31889190/1344789>.


On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa <ali.ba...@gmail.com> wrote:

> Thanks again Ayan! To close the loop on this issue, I have filed the below
> JIRA to track the issue:
> https://issues.apache.org/jira/browse/SPARK-7197
>
>
>
> On Fri, Apr 24, 2015 at 8:21 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> I just tested, your observation in DataFrame API is correct. It behaves
>> weirdly in case of multiple column join.  (Maybe we should report a Jira?)
>>
>> Solution: You can go back to our good old composite key field
>> concatenation method. Not ideal, but workaround. (Of course you can use
>> realSQL as well, as shown below)
>>
>> set up Data:
>>
>>     a =
>> [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,10000]]
>>     b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]]
>>     YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
>> int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>>     YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
>> int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>>     print YM1.collect()
>>     print YM2.collect()
>>
>>     YM1DF = ssc.createDataFrame(YM1)
>>     YM2DF = ssc.createDataFrame(YM2)
>>
>>     print YM1DF.printSchema()
>>     print YM2DF.printSchema()
>>    ------------------------------------------------
>> This DOES NOT WORK ---
>>
>>     YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
>> YM1DF.mn==YM2DF.mn,"inner")
>>     print YMJN.printSchema()
>>     for l in YMJN.collect():
>>         print l
>>
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
>> mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
>> mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993,
>> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=10000, yr=1993,
>> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'1994~1', mn=1, price=10000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>>
>> -----------------------------------------
>>
>> SQL Solution - works as expected
>>
>>     YM1DF.registerTempTable("ymdf1")
>>     YM2DF.registerTempTable("ymdf2")
>>     YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on
>> ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn")
>>     print YMJNS.printSchema()
>>     for l in YMJNS.collect():
>>         print l
>>
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>>
>> -----------------------------------------------------------------
>>
>> Field concat method, works as well....
>>
>>     YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner")
>>     print YMJNA.printSchema()
>>     for l in YMJNA.collect():
>>         print l
>>
>> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994,
>> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
>> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
>> mn=1, name=u'A', yr=1993)
>> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
>> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>>
>> On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa <ali.ba...@gmail.com> wrote:
>>
>>> Any ideas on this? Any sample code to join 2 data frames on two columns?
>>>
>>> Thanks
>>> Ali
>>>
>>> On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote:
>>>
>>> > Hi experts,
>>> >
>>> > Sorry if this is a n00b question or has already been answered...
>>> >
>>> > Am trying to use the data frames API in python to join 2 dataframes
>>> > with more than 1 column. The example I've seen in the documentation
>>> > only shows a single column - so I tried this:
>>> >
>>> > ****Example code****
>>> >
>>> > import pandas as pd
>>> > from pyspark.sql import SQLContext
>>> > hc = SQLContext(sc)
>>> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
>>> > '12', '12'], 'value': [100, 200, 300]})
>>> > a = hc.createDataFrame(A)
>>> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
>>> > 'value': [101, 102]})
>>> > b = hc.createDataFrame(B)
>>> >
>>> > print "Pandas"  # try with Pandas
>>> > print A
>>> > print B
>>> > print pd.merge(A, B, on=['year', 'month'], how='inner')
>>> >
>>> > print "Spark"
>>> > print a.toPandas()
>>> > print b.toPandas()
>>> > print a.join(b, a.year==b.year and a.month==b.month,
>>> 'inner').toPandas()
>>> >
>>> >
>>> > *****Output****
>>> >
>>> > Pandas
>>> >  month  value  year
>>> > 0     5    100  1993
>>> > 1    12    200  2005
>>> > 2    12    300  1994
>>> >
>>> >  month  value  year
>>> > 0    12    101  1993
>>> > 1    12    102  1993
>>> >
>>> > Empty DataFrame
>>> >
>>> > Columns: [month, value_x, year, value_y]
>>> >
>>> > Index: []
>>> >
>>> > Spark
>>> >  month  value  year
>>> > 0     5    100  1993
>>> > 1    12    200  2005
>>> > 2    12    300  1994
>>> >
>>> >  month  value  year
>>> > 0    12    101  1993
>>> > 1    12    102  1993
>>> >
>>> > month  value  year month  value  year
>>> > 0    12    200  2005    12    102  1993
>>> > 1    12    200  2005    12    101  1993
>>> > 2    12    300  1994    12    102  1993
>>> > 3    12    300  1994    12    101  1993
>>> >
>>> > It looks like Spark returns some results where an inner join should
>>> > return nothing.
>>> >
>>> > Am I doing the join with two columns in the wrong way? If yes, what is
>>> > the right syntax for this?
>>> >
>>> > Thanks!
>>> > Ali
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: Question regarding join with multiple columns with pyspark

Reply via email to