[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459925#comment-15459925
 ] 

Miguel Tormo edited comment on SPARK-17211 at 9/3/16 9:16 PM:
--------------------------------------------------------------

[~gurmukhd], you say it works only for heap < 32 GB, which one do you mean? 
Driver memory or executors? There is only need to disable compressed oops for 
the heap that is under 32 GB when the other isn't. You can test disabling it 
for both cases to be on the safe side:

spark-shell ...  --driver-java-options "-XX:-UseCompressedOops" --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops" ...

[~davies] I just tried the patch, it seems to work! I couldn't break it 
although probably it needs more thorough testing to cover other cases. But it 
does seem to fix the issue, thanks!




was (Author: migtor):
[~gurmukhd], you say it works only for heap < 32 GB, which one do you mean? 
Driver memory or executors? There is only need to disable compressed oops for 
the heap that is over 32 GB when the other isn't. You can test disabling it for 
both cases to be on the safe side:

spark-shell ...  --driver-java-options "-XX:-UseCompressedOops" --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops" ...

[~davies] I just tried the patch, it seems to work! I couldn't break it 
although probably it needs more thorough testing to cover other cases. But it 
does seem to fix the issue, thanks!



> Broadcast join produces incorrect results when compressed Oops differs 
> between driver, executor
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17211
>                 URL: https://issues.apache.org/jira/browse/SPARK-17211
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Jarno Seppanen
>            Assignee: Davies Liu
>
> Broadcast join produces incorrect columns in join result, see below for an 
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
>     (54000000, 0),
>     (54000001, 1),
>     (54000002, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # +--------+-----+
> # |  key_id|value|
> # +--------+-----+
> # |54000000|    0|
> # |54000001|    1|
> # |54000002|    2|
> # +--------+-----+
> data = [
>     (54000002,    1),
>     (54000000,    2),
>     (54000001,    3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # +--------+---+                                                              
>     
> # |  key_id|foo|
> # +--------+---+
> # |54000002|  1|
> # |54000000|  2|
> # |54000001|  3|
> # +--------+---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # +--------+---+--------+                                                     
>     
> # |  key_id|foo|   value|
> # +--------+---+--------+
> # |54000002|  1|54000002|
> # |54000000|  2|54000000|
> # |54000001|  3|54000001|
> # +--------+---+--------+
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # +--------+---+-----+
> # |  key_id|foo|value|
> # +--------+---+-----+
> # |54000000|  2|    0|
> # |54000001|  3|    1|
> # |54000002|  1|    2|
> # +--------+---+-----+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to