Jarno Seppanen created SPARK-17211: -------------------------------------- Summary: Broadcast join produces incorrect results Key: SPARK-17211 URL: https://issues.apache.org/jira/browse/SPARK-17211 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Jarno Seppanen
Broadcast join produces incorrect columns in join result, see below for an example. The same join but without using broadcast gives the correct columns. Running PySpark on YARN on Amazon EMR 5.0.0. {noformat} import pyspark.sql.functions as func keys = [ (54000000, 0), (54000001, 1), (54000002, 2), ] keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1) keys_df.show() # +--------+-----+ # | key_id|value| # +--------+-----+ # |54000000| 0| # |54000001| 1| # |54000002| 2| # +--------+-----+ data = [ (54000002, 1), (54000000, 2), (54000001, 3), ] data_df = spark.createDataFrame(data, ['key_id', 'foo']) data_df.show() # +--------+---+ # | key_id|foo| # +--------+---+ # |54000002| 1| # |54000000| 2| # |54000001| 3| # +--------+---+ ### INCORRECT ### data_df.join(func.broadcast(keys_df), 'key_id').show() # +--------+---+--------+ # | key_id|foo| value| # +--------+---+--------+ # |54000002| 1|54000002| # |54000000| 2|54000000| # |54000001| 3|54000001| # +--------+---+--------+ ### CORRECT ### data_df.join(keys_df, 'key_id').show() # +--------+---+-----+ # | key_id|foo|value| # +--------+---+-----+ # |54000000| 2| 0| # |54000001| 3| 1| # |54000002| 1| 2| # +--------+---+-----+ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org