[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

Adam Roberts (JIRA) Wed, 09 Dec 2015 05:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048636#comment-15048636
 ]


Adam Roberts edited comment on SPARK-9858 at 12/9/15 1:07 PM:
--------------------------------------------------------------

Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

I'm wondering if there's an extra factor we should take into account when 
determining the indices regardless of platform.


was (Author: aroberts):
Thanks for the prompt reply, rowBuffer is a variable in 
org.apache.spark.sql.execution.UnsafeRowSerializer within the 
asKeyValueIterator method. I experimented with the Exchange class, same 
problems are observed using the SparkSqlSeriaizer; suggesting the 
UnsafeRowSerializer is probably fine.

I agree with your second comment, I think the code within 
org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere.

It'll be useful to determine how the values in the assertions can be determined 
programatically, I think the partitioning algorithm itself is working as 
expected but for some reason stages require more bytes on the platforms I'm 
using.

spark.sql.shuffle.partitions is unchanged, I'm working off the latest master 
code.

Is there something special about the aggregate, join, and complex query 2 tests?

Can we print exactly what the bytes are for each stage? I know rdd.count is 
always correct and the DataFrames are the same (printed each row, written to 
json and parquet - no concerns).

Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate 
test passes.

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-9858
>                 URL: https://issues.apache.org/jira/browse/SPARK-9858
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>             Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

Reply via email to