[ https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453936#comment-15453936 ]
Himanish Kushary commented on SPARK-17211: ------------------------------------------ [~gurmukhd] You are seeing the errors outside of EMR environment also, right ? On Databricks I ran it through a scala notebook, not sure what the underlying configuration was. I would want to add one more thing , while running the real job (with lots of data), I noticed even with low driver memory settings and broadcasting enabled , some joins works fine but others messes up the data (either assigns null or field values get mixed up). What are the other things that would use up 12 GB of memory on the node ? > Broadcast join produces incorrect results on EMR with large driver memory > ------------------------------------------------------------------------- > > Key: SPARK-17211 > URL: https://issues.apache.org/jira/browse/SPARK-17211 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Jarno Seppanen > > Broadcast join produces incorrect columns in join result, see below for an > example. The same join but without using broadcast gives the correct columns. > Running PySpark on YARN on Amazon EMR 5.0.0. > {noformat} > import pyspark.sql.functions as func > keys = [ > (54000000, 0), > (54000001, 1), > (54000002, 2), > ] > keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1) > keys_df.show() > # +--------+-----+ > # | key_id|value| > # +--------+-----+ > # |54000000| 0| > # |54000001| 1| > # |54000002| 2| > # +--------+-----+ > data = [ > (54000002, 1), > (54000000, 2), > (54000001, 3), > ] > data_df = spark.createDataFrame(data, ['key_id', 'foo']) > data_df.show() > # +--------+---+ > > # | key_id|foo| > # +--------+---+ > # |54000002| 1| > # |54000000| 2| > # |54000001| 3| > # +--------+---+ > ### INCORRECT ### > data_df.join(func.broadcast(keys_df), 'key_id').show() > # +--------+---+--------+ > > # | key_id|foo| value| > # +--------+---+--------+ > # |54000002| 1|54000002| > # |54000000| 2|54000000| > # |54000001| 3|54000001| > # +--------+---+--------+ > ### CORRECT ### > data_df.join(keys_df, 'key_id').show() > # +--------+---+-----+ > # | key_id|foo|value| > # +--------+---+-----+ > # |54000000| 2| 0| > # |54000001| 3| 1| > # |54000002| 1| 2| > # +--------+---+-----+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org