The DataFrame issue has been fixed in Spark 1.5. Refer to SPARK-7990 <https://issues.apache.org/jira/browse/SPARK-7990> and Stackoverflow: Spark specify multiple column conditions for dataframe join <http://stackoverflow.com/a/31889190/1344789>.
On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa <ali.ba...@gmail.com> wrote: > Thanks again Ayan! To close the loop on this issue, I have filed the below > JIRA to track the issue: > https://issues.apache.org/jira/browse/SPARK-7197 > > > > On Fri, Apr 24, 2015 at 8:21 PM, ayan guha <guha.a...@gmail.com> wrote: > >> I just tested, your observation in DataFrame API is correct. It behaves >> weirdly in case of multiple column join. (Maybe we should report a Jira?) >> >> Solution: You can go back to our good old composite key field >> concatenation method. Not ideal, but workaround. (Of course you can use >> realSQL as well, as shown below) >> >> set up Data: >> >> a = >> [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,10000]] >> b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]] >> YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn = >> int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1]))) >> YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn = >> int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1]))) >> print YM1.collect() >> print YM2.collect() >> >> YM1DF = ssc.createDataFrame(YM1) >> YM2DF = ssc.createDataFrame(YM2) >> >> print YM1DF.printSchema() >> print YM2DF.printSchema() >> ------------------------------------------------ >> This DOES NOT WORK --- >> >> YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and >> YM1DF.mn==YM2DF.mn,"inner") >> print YMJN.printSchema() >> for l in YMJN.collect(): >> print l >> >> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', >> mn=1, name=u'A', yr=1993) >> Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1', >> mn=1, name=u'AA', yr=1994) >> Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1', >> mn=1, name=u'AAA', yr=2000) >> Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, >> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) >> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, >> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) >> Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, >> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) >> Row(joiningKey=u'1993~1', mn=1, price=10000, yr=1993, >> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993) >> Row(joiningKey=u'1994~1', mn=1, price=10000, yr=1994, >> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) >> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, >> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) >> >> ----------------------------------------- >> >> SQL Solution - works as expected >> >> YM1DF.registerTempTable("ymdf1") >> YM2DF.registerTempTable("ymdf2") >> YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on >> ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn") >> print YMJNS.printSchema() >> for l in YMJNS.collect(): >> print l >> >> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, >> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) >> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, >> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) >> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', >> mn=1, name=u'A', yr=1993) >> >> ----------------------------------------------------------------- >> >> Field concat method, works as well.... >> >> YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner") >> print YMJNA.printSchema() >> for l in YMJNA.collect(): >> print l >> >> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, >> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994) >> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1', >> mn=1, name=u'A', yr=1993) >> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000, >> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000) >> >> On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa <ali.ba...@gmail.com> wrote: >> >>> Any ideas on this? Any sample code to join 2 data frames on two columns? >>> >>> Thanks >>> Ali >>> >>> On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote: >>> >>> > Hi experts, >>> > >>> > Sorry if this is a n00b question or has already been answered... >>> > >>> > Am trying to use the data frames API in python to join 2 dataframes >>> > with more than 1 column. The example I've seen in the documentation >>> > only shows a single column - so I tried this: >>> > >>> > ****Example code**** >>> > >>> > import pandas as pd >>> > from pyspark.sql import SQLContext >>> > hc = SQLContext(sc) >>> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', >>> > '12', '12'], 'value': [100, 200, 300]}) >>> > a = hc.createDataFrame(A) >>> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], >>> > 'value': [101, 102]}) >>> > b = hc.createDataFrame(B) >>> > >>> > print "Pandas" # try with Pandas >>> > print A >>> > print B >>> > print pd.merge(A, B, on=['year', 'month'], how='inner') >>> > >>> > print "Spark" >>> > print a.toPandas() >>> > print b.toPandas() >>> > print a.join(b, a.year==b.year and a.month==b.month, >>> 'inner').toPandas() >>> > >>> > >>> > *****Output**** >>> > >>> > Pandas >>> > month value year >>> > 0 5 100 1993 >>> > 1 12 200 2005 >>> > 2 12 300 1994 >>> > >>> > month value year >>> > 0 12 101 1993 >>> > 1 12 102 1993 >>> > >>> > Empty DataFrame >>> > >>> > Columns: [month, value_x, year, value_y] >>> > >>> > Index: [] >>> > >>> > Spark >>> > month value year >>> > 0 5 100 1993 >>> > 1 12 200 2005 >>> > 2 12 300 1994 >>> > >>> > month value year >>> > 0 12 101 1993 >>> > 1 12 102 1993 >>> > >>> > month value year month value year >>> > 0 12 200 2005 12 102 1993 >>> > 1 12 200 2005 12 101 1993 >>> > 2 12 300 1994 12 102 1993 >>> > 3 12 300 1994 12 101 1993 >>> > >>> > It looks like Spark returns some results where an inner join should >>> > return nothing. >>> > >>> > Am I doing the join with two columns in the wrong way? If yes, what is >>> > the right syntax for this? >>> > >>> > Thanks! >>> > Ali >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >