Re: Question regarding join with multiple columns with pyspark

Ali Bajwa Tue, 28 Apr 2015 09:57:18 -0700

Thanks again Ayan! To close the loop on this issue, I have filed the below
JIRA to track the issue:
https://issues.apache.org/jira/browse/SPARK-7197




On Fri, Apr 24, 2015 at 8:21 PM, ayan guha <guha.a...@gmail.com> wrote:

> I just tested, your observation in DataFrame API is correct. It behaves
> weirdly in case of multiple column join.  (Maybe we should report a Jira?)
>
> Solution: You can go back to our good old composite key field
> concatenation method. Not ideal, but workaround. (Of course you can use
> realSQL as well, as shown below)
>
> set up Data:
>
>     a =
> [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,10000]]
>     b = [[1993,1,"A"],[1994,1,"AA"],[2000,1,"AAA"]]
>     YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
> int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>     YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
> int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+"~"+str(tup[1])))
>     print YM1.collect()
>     print YM2.collect()
>
>     YM1DF = ssc.createDataFrame(YM1)
>     YM2DF = ssc.createDataFrame(YM2)
>
>     print YM1DF.printSchema()
>     print YM2DF.printSchema()
>    ------------------------------------------------
> This DOES NOT WORK ---
>
>     YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
> YM1DF.mn==YM2DF.mn,"inner")
>     print YMJN.printSchema()
>     for l in YMJN.collect():
>         print l
>
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
> mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1',
> mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=10000, yr=1993,
> joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'1994~1', mn=1, price=10000, yr=1994,
> joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>
> -----------------------------------------
>
> SQL Solution - works as expected
>
>     YM1DF.registerTempTable("ymdf1")
>     YM2DF.registerTempTable("ymdf2")
>     YMJNS = ssc.sql("select * from ymdf1 inner join ymdf2 on
> ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn")
>     print YMJNS.printSchema()
>     for l in YMJNS.collect():
>         print l
>
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
>
> -----------------------------------------------------------------
>
> Field concat method, works as well....
>
>     YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,"inner")
>     print YMJNA.printSchema()
>     for l in YMJNA.collect():
>         print l
>
> Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
> mn=1, name=u'AA', yr=1994)
> Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
> mn=1, name=u'A', yr=1993)
> Row(joiningKey=u'2000~1', mn=1, price=10000, yr=2000,
> joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
>
> On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa <ali.ba...@gmail.com> wrote:
>
>> Any ideas on this? Any sample code to join 2 data frames on two columns?
>>
>> Thanks
>> Ali
>>
>> On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote:
>>
>> > Hi experts,
>> >
>> > Sorry if this is a n00b question or has already been answered...
>> >
>> > Am trying to use the data frames API in python to join 2 dataframes
>> > with more than 1 column. The example I've seen in the documentation
>> > only shows a single column - so I tried this:
>> >
>> > ****Example code****
>> >
>> > import pandas as pd
>> > from pyspark.sql import SQLContext
>> > hc = SQLContext(sc)
>> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
>> > '12', '12'], 'value': [100, 200, 300]})
>> > a = hc.createDataFrame(A)
>> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
>> > 'value': [101, 102]})
>> > b = hc.createDataFrame(B)
>> >
>> > print "Pandas"  # try with Pandas
>> > print A
>> > print B
>> > print pd.merge(A, B, on=['year', 'month'], how='inner')
>> >
>> > print "Spark"
>> > print a.toPandas()
>> > print b.toPandas()
>> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
>> >
>> >
>> > *****Output****
>> >
>> > Pandas
>> >  month  value  year
>> > 0     5    100  1993
>> > 1    12    200  2005
>> > 2    12    300  1994
>> >
>> >  month  value  year
>> > 0    12    101  1993
>> > 1    12    102  1993
>> >
>> > Empty DataFrame
>> >
>> > Columns: [month, value_x, year, value_y]
>> >
>> > Index: []
>> >
>> > Spark
>> >  month  value  year
>> > 0     5    100  1993
>> > 1    12    200  2005
>> > 2    12    300  1994
>> >
>> >  month  value  year
>> > 0    12    101  1993
>> > 1    12    102  1993
>> >
>> > month  value  year month  value  year
>> > 0    12    200  2005    12    102  1993
>> > 1    12    200  2005    12    101  1993
>> > 2    12    300  1994    12    102  1993
>> > 3    12    300  1994    12    101  1993
>> >
>> > It looks like Spark returns some results where an inner join should
>> > return nothing.
>> >
>> > Am I doing the join with two columns in the wrong way? If yes, what is
>> > the right syntax for this?
>> >
>> > Thanks!
>> > Ali
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Question regarding join with multiple columns with pyspark

Reply via email to