[ https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davies Liu resolved SPARK-17211. -------------------------------- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14927 [https://github.com/apache/spark/pull/14927] > Broadcast join produces incorrect results when compressed Oops differs > between driver, executor > ----------------------------------------------------------------------------------------------- > > Key: SPARK-17211 > URL: https://issues.apache.org/jira/browse/SPARK-17211 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Jarno Seppanen > Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > Broadcast join produces incorrect columns in join result, see below for an > example. The same join but without using broadcast gives the correct columns. > Running PySpark on YARN on Amazon EMR 5.0.0. > {noformat} > import pyspark.sql.functions as func > keys = [ > (54000000, 0), > (54000001, 1), > (54000002, 2), > ] > keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1) > keys_df.show() > # +--------+-----+ > # | key_id|value| > # +--------+-----+ > # |54000000| 0| > # |54000001| 1| > # |54000002| 2| > # +--------+-----+ > data = [ > (54000002, 1), > (54000000, 2), > (54000001, 3), > ] > data_df = spark.createDataFrame(data, ['key_id', 'foo']) > data_df.show() > # +--------+---+ > > # | key_id|foo| > # +--------+---+ > # |54000002| 1| > # |54000000| 2| > # |54000001| 3| > # +--------+---+ > ### INCORRECT ### > data_df.join(func.broadcast(keys_df), 'key_id').show() > # +--------+---+--------+ > > # | key_id|foo| value| > # +--------+---+--------+ > # |54000002| 1|54000002| > # |54000000| 2|54000000| > # |54000001| 3|54000001| > # +--------+---+--------+ > ### CORRECT ### > data_df.join(keys_df, 'key_id').show() > # +--------+---+-----+ > # | key_id|foo|value| > # +--------+---+-----+ > # |54000000| 2| 0| > # |54000001| 3| 1| > # |54000002| 1| 2| > # +--------+---+-----+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org