Any ideas on this? Any sample code to join 2 data frames on two columns?

Thanks
Ali

On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote:

> Hi experts,
>
> Sorry if this is a n00b question or has already been answered...
>
> Am trying to use the data frames API in python to join 2 dataframes
> with more than 1 column. The example I've seen in the documentation
> only shows a single column - so I tried this:
>
> ****Example code****
>
> import pandas as pd
> from pyspark.sql import SQLContext
> hc = SQLContext(sc)
> A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> '12', '12'], 'value': [100, 200, 300]})
> a = hc.createDataFrame(A)
> B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> 'value': [101, 102]})
> b = hc.createDataFrame(B)
>
> print "Pandas"  # try with Pandas
> print A
> print B
> print pd.merge(A, B, on=['year', 'month'], how='inner')
>
> print "Spark"
> print a.toPandas()
> print b.toPandas()
> print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
>
>
> *****Output****
>
> Pandas
>  month  value  year
> 0     5    100  1993
> 1    12    200  2005
> 2    12    300  1994
>
>  month  value  year
> 0    12    101  1993
> 1    12    102  1993
>
> Empty DataFrame
>
> Columns: [month, value_x, year, value_y]
>
> Index: []
>
> Spark
>  month  value  year
> 0     5    100  1993
> 1    12    200  2005
> 2    12    300  1994
>
>  month  value  year
> 0    12    101  1993
> 1    12    102  1993
>
> month  value  year month  value  year
> 0    12    200  2005    12    102  1993
> 1    12    200  2005    12    101  1993
> 2    12    300  1994    12    102  1993
> 3    12    300  1994    12    101  1993
>
> It looks like Spark returns some results where an inner join should
> return nothing.
>
> Am I doing the join with two columns in the wrong way? If yes, what is
> the right syntax for this?
>
> Thanks!
> Ali

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to