Jarno Seppanen created SPARK-17211:
--------------------------------------

             Summary: Broadcast join produces incorrect results
                 Key: SPARK-17211
                 URL: https://issues.apache.org/jira/browse/SPARK-17211
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Jarno Seppanen


Broadcast join produces incorrect columns in join result, see below for an 
example. The same join but without using broadcast gives the correct columns.

Running PySpark on YARN on Amazon EMR 5.0.0.

{noformat}

import pyspark.sql.functions as func

keys = [
    (54000000, 0),
    (54000001, 1),
    (54000002, 2),
]

keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
keys_df.show()
# +--------+-----+
# |  key_id|value|
# +--------+-----+
# |54000000|    0|
# |54000001|    1|
# |54000002|    2|
# +--------+-----+

data = [
    (54000002,    1),
    (54000000,    2),
    (54000001,    3),
]

data_df = spark.createDataFrame(data, ['key_id', 'foo'])
data_df.show()
# +--------+---+                                                                
  
# |  key_id|foo|
# +--------+---+
# |54000002|  1|
# |54000000|  2|
# |54000001|  3|
# +--------+---+

### INCORRECT ###

data_df.join(func.broadcast(keys_df), 'key_id').show()
# +--------+---+--------+                                                       
  
# |  key_id|foo|   value|
# +--------+---+--------+
# |54000002|  1|54000002|
# |54000000|  2|54000000|
# |54000001|  3|54000001|
# +--------+---+--------+

### CORRECT ###

data_df.join(keys_df, 'key_id').show()
# +--------+---+-----+
# |  key_id|foo|value|
# +--------+---+-----+
# |54000000|  2|    0|
# |54000001|  3|    1|
# |54000002|  1|    2|
# +--------+---+-----+
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to