[jira] [Created] (SPARK-17211) Broadcast join produces incorrect results

Jarno Seppanen (JIRA) Wed, 24 Aug 2016 00:54:07 -0700

Jarno Seppanen created SPARK-17211:
--------------------------------------

             Summary: Broadcast join produces incorrect results
                 Key: SPARK-17211
                 URL: https://issues.apache.org/jira/browse/SPARK-17211
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Jarno Seppanen



Broadcast join produces incorrect columns in join result, see below for an 
example. The same join but without using broadcast gives the correct columns.

Running PySpark on YARN on Amazon EMR 5.0.0.

{noformat}

import pyspark.sql.functions as func

keys = [
    (54000000, 0),
    (54000001, 1),
    (54000002, 2),
]

keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
keys_df.show()
# +--------+-----+
# |  key_id|value|
# +--------+-----+
# |54000000|    0|
# |54000001|    1|
# |54000002|    2|
# +--------+-----+

data = [
    (54000002,    1),
    (54000000,    2),
    (54000001,    3),
]

data_df = spark.createDataFrame(data, ['key_id', 'foo'])
data_df.show()
# +--------+---+                                                                
  
# |  key_id|foo|
# +--------+---+
# |54000002|  1|
# |54000000|  2|
# |54000001|  3|
# +--------+---+

### INCORRECT ###

data_df.join(func.broadcast(keys_df), 'key_id').show()
# +--------+---+--------+                                                       
  
# |  key_id|foo|   value|
# +--------+---+--------+
# |54000002|  1|54000002|
# |54000000|  2|54000000|
# |54000001|  3|54000001|
# +--------+---+--------+

### CORRECT ###

data_df.join(keys_df, 'key_id').show()
# +--------+---+-----+
# |  key_id|foo|value|
# +--------+---+-----+
# |54000000|  2|    0|
# |54000001|  3|    1|
# |54000002|  1|    2|
# +--------+---+-----+
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17211) Broadcast join produces incorrect results

Reply via email to