Ali Bajwa created SPARK-7197: -------------------------------- Summary: Join with DataFrame Python API not working properly with more than 1 column Key: SPARK-7197 URL: https://issues.apache.org/jira/browse/SPARK-7197 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.1 Reporter: Ali Bajwa
It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns. The example in the documentation only shows a single column. Here is an example to show the problem: ****Example code**** import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', '12', '12'], 'value': [100, 200, 300]}) a = hc.createDataFrame(A) B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], 'value': [101, 102]}) b = hc.createDataFrame(B) print "Pandas" # try with Pandas print A print B print pd.merge(A, B, on=['year', 'month'], how='inner') print "Spark" print a.toPandas() print b.toPandas() print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() *****Output**** Pandas month value year 0 5 100 1993 1 12 200 2005 2 12 300 1994 month value year 0 12 101 1993 1 12 102 1993 Empty DataFrame Columns: [month, value_x, year, value_y] Index: [] Spark month value year 0 5 100 1993 1 12 200 2005 2 12 300 1994 month value year 0 12 101 1993 1 12 102 1993 month value year month value year 0 12 200 2005 12 102 1993 1 12 200 2005 12 101 1993 2 12 300 1994 12 102 1993 3 12 300 1994 12 101 1993 It looks like Spark returns some results where an inner join should return nothing. Confirmed on user mailing list as an issue with Ayan Guha. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org