[ https://issues.apache.org/jira/browse/SPARK-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-7197: ------------------------------- Priority: Critical (was: Major) > Join with DataFrame Python API not working properly with more than 1 column > --------------------------------------------------------------------------- > > Key: SPARK-7197 > URL: https://issues.apache.org/jira/browse/SPARK-7197 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.3.1 > Reporter: Ali Bajwa > Priority: Critical > > It looks like join with DataFrames API in python does not return correct > results if using more 2 or more columns. The example in the documentation > only shows a single column. > Here is an example to show the problem: > ****Example code**** > {code} > import pandas as pd > from pyspark.sql import SQLContext > hc = SQLContext(sc) > A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', > '12', '12'], 'value': [100, 200, 300]}) > a = hc.createDataFrame(A) > B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], > 'value': [101, 102]}) > b = hc.createDataFrame(B) > print "Pandas" # try with Pandas > print A > print B > print pd.merge(A, B, on=['year', 'month'], how='inner') > print "Spark" > print a.toPandas() > print b.toPandas() > print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() > {code} > *****Output**** > {code} > Pandas > month value year > 0 5 100 1993 > 1 12 200 2005 > 2 12 300 1994 > month value year > 0 12 101 1993 > 1 12 102 1993 > Empty DataFrame > Columns: [month, value_x, year, value_y] > Index: [] > Spark > month value year > 0 5 100 1993 > 1 12 200 2005 > 2 12 300 1994 > month value year > 0 12 101 1993 > 1 12 102 1993 > month value year month value year > 0 12 200 2005 12 102 1993 > 1 12 200 2005 12 101 1993 > 2 12 300 1994 12 102 1993 > 3 12 300 1994 12 101 1993 > {code} > It looks like Spark returns some results where an inner join should > return nothing. > Confirmed on user mailing list as an issue with Ayan Guha. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org