Thanks again Ayan! To close the loop on this issue, I have filed the below JIRA to track the issue: https://issues.apache.org/jira/browse/SPARK-7197
On Fri, Apr 24, 2015 at 8:21 PM, ayan guha <guha.a...@gmail.com> wrote: > I just tested, your observation in DataFrame API is correct. It behaves > weirdly in case of multiple column join. (Maybe we should report a Jira?) > > Solution: You can go back to our good old composite key field > concatenation method. Not ideal, but workaround. (Of course you can use > realSQL as well, as shown below) > > set up Data: > > a = > [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,10000]] > b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]] > YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn = > int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1]))) > YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn = > int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1]))) > print YM1.collect() > print YM2.collect() > > YM1DF = ssc.createDataFrame(YM1) > YM2DF = ssc.createDataFrame(YM2) > > print YM1DF.printSchema() > print YM2DF.printSchema() > ------------------------------------------------ > This DOES NOT WORK --- > > YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and > YM1DF.mn==YM2DF.mn,"inner") > print YMJN.printSchema() > for l in YMJN.collect(): > print l > > Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', > mn=1, name=u'A', yr=1993) > Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1', > mn=1, name=u'AA', yr=1994) > Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1', > mn=1, name=u'AAA', yr=2000) > Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1', > mn=1, name=u'A', yr=1993) > Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', > mn=1, name=u'AA', yr=1994) > Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1', > mn=1, name=u'AAA', yr=2000) > Row(joiningKey=u'1993~1', mn=1, price=10000, yr=1993, > joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) > Row(joiningKey=u'1994~1', mn=1, price=10000, yr=1994, > joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) > Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, > joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) > > ----------------------------------------- > > SQL Solution - works as expected > > YM1DF.registerTempTable("ymdf1") > YM2DF.registerTempTable("ymdf2") > YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on > ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn") > print YMJNS.printSchema() > for l in YMJNS.collect(): > print l > > Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', > mn=1, name=u'AA', yr=1994) > Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, > joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) > Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', > mn=1, name=u'A', yr=1993) > > ----------------------------------------------------------------- > > Field concat method, works as well.... > > YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner") > print YMJNA.printSchema() > for l in YMJNA.collect(): > print l > > Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1', > mn=1, name=u'AA', yr=1994) > Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', > mn=1, name=u'A', yr=1993) > Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, > joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) > > On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa <ali.ba...@gmail.com> wrote: > >> Any ideas on this? Any sample code to join 2 data frames on two columns? >> >> Thanks >> Ali >> >> On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote: >> >> > Hi experts, >> > >> > Sorry if this is a n00b question or has already been answered... >> > >> > Am trying to use the data frames API in python to join 2 dataframes >> > with more than 1 column. The example I've seen in the documentation >> > only shows a single column - so I tried this: >> > >> > ****Example code**** >> > >> > import pandas as pd >> > from pyspark.sql import SQLContext >> > hc = SQLContext(sc) >> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', >> > '12', '12'], 'value': [100, 200, 300]}) >> > a = hc.createDataFrame(A) >> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], >> > 'value': [101, 102]}) >> > b = hc.createDataFrame(B) >> > >> > print "Pandas" # try with Pandas >> > print A >> > print B >> > print pd.merge(A, B, on=['year', 'month'], how='inner') >> > >> > print "Spark" >> > print a.toPandas() >> > print b.toPandas() >> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() >> > >> > >> > *****Output**** >> > >> > Pandas >> > month value year >> > 0 5 100 1993 >> > 1 12 200 2005 >> > 2 12 300 1994 >> > >> > month value year >> > 0 12 101 1993 >> > 1 12 102 1993 >> > >> > Empty DataFrame >> > >> > Columns: [month, value_x, year, value_y] >> > >> > Index: [] >> > >> > Spark >> > month value year >> > 0 5 100 1993 >> > 1 12 200 2005 >> > 2 12 300 1994 >> > >> > month value year >> > 0 12 101 1993 >> > 1 12 102 1993 >> > >> > month value year month value year >> > 0 12 200 2005 12 102 1993 >> > 1 12 200 2005 12 101 1993 >> > 2 12 300 1994 12 102 1993 >> > 3 12 300 1994 12 101 1993 >> > >> > It looks like Spark returns some results where an inner join should >> > return nothing. >> > >> > Am I doing the join with two columns in the wrong way? If yes, what is >> > the right syntax for this? >> > >> > Thanks! >> > Ali >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha >