I just tested your pr
On 25 Apr 2015 10:18, "Ali Bajwa" <ali.ba...@gmail.com> wrote:

> Any ideas on this? Any sample code to join 2 data frames on two columns?
>
> Thanks
> Ali
>
> On Apr 23, 2015, at 1:05 PM, Ali Bajwa <ali.ba...@gmail.com> wrote:
>
> > Hi experts,
> >
> > Sorry if this is a n00b question or has already been answered...
> >
> > Am trying to use the data frames API in python to join 2 dataframes
> > with more than 1 column. The example I've seen in the documentation
> > only shows a single column - so I tried this:
> >
> > ****Example code****
> >
> > import pandas as pd
> > from pyspark.sql import SQLContext
> > hc = SQLContext(sc)
> > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
> > '12', '12'], 'value': [100, 200, 300]})
> > a = hc.createDataFrame(A)
> > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
> > 'value': [101, 102]})
> > b = hc.createDataFrame(B)
> >
> > print "Pandas"  # try with Pandas
> > print A
> > print B
> > print pd.merge(A, B, on=['year', 'month'], how='inner')
> >
> > print "Spark"
> > print a.toPandas()
> > print b.toPandas()
> > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
> >
> >
> > *****Output****
> >
> > Pandas
> >  month  value  year
> > 0     5    100  1993
> > 1    12    200  2005
> > 2    12    300  1994
> >
> >  month  value  year
> > 0    12    101  1993
> > 1    12    102  1993
> >
> > Empty DataFrame
> >
> > Columns: [month, value_x, year, value_y]
> >
> > Index: []
> >
> > Spark
> >  month  value  year
> > 0     5    100  1993
> > 1    12    200  2005
> > 2    12    300  1994
> >
> >  month  value  year
> > 0    12    101  1993
> > 1    12    102  1993
> >
> > month  value  year month  value  year
> > 0    12    200  2005    12    102  1993
> > 1    12    200  2005    12    101  1993
> > 2    12    300  1994    12    102  1993
> > 3    12    300  1994    12    101  1993
> >
> > It looks like Spark returns some results where an inner join should
> > return nothing.
> >
> > Am I doing the join with two columns in the wrong way? If yes, what is
> > the right syntax for this?
> >
> > Thanks!
> > Ali
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to