[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048636#comment-15048636 ]
Adam Roberts edited comment on SPARK-9858 at 12/9/15 1:07 PM: -------------------------------------------------------------- Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. I'm wondering if there's an extra factor we should take into account when determining the indices regardless of platform. was (Author: aroberts): Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > ----------------------------------------------------------------------------------- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Yin Huai > Assignee: Yin Huai > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org