[ https://issues.apache.org/jira/browse/SPARK-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kay Ousterhout updated SPARK-6026: ---------------------------------- Description: The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle when the number of partitions is less than this) is basically a workaround for SparkSQL, because the fact that the sort-based shuffle stores non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. Once the sort-based shuffle is changed to store serialized objects, we should never be secretly doing hash-ish shuffle even when the user has specified to use sort-based shuffle (because of its otherwise worse performance). [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from you on this proposal (and also a sanity check that I've correctly characterized the issue). was: The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle when the number of partitions is less than this) is basically a workaround for SparkSQL, because the fact that the sort-based shuffle stores non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. Once the sort-based shuffle is changed to store serialized object, we should never be secretly doing hash-ish shuffle even when the user has specified to use sort-based shuffle (because of its otherwise worse performance). [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from you on this proposal (and also a sanity check that I've correctly characterized the issue). > Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle > within the Sort shuffle code > --------------------------------------------------------------------------------------------------------- > > Key: SPARK-6026 > URL: https://issues.apache.org/jira/browse/SPARK-6026 > Project: Spark > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Kay Ousterhout > > The bypassMergeThreshold parameter (and associated use of a hash-ish shuffle > when the number of partitions is less than this) is basically a workaround > for SparkSQL, because the fact that the sort-based shuffle stores > non-serialized objects is a deal-breaker for SparkSQL, which re-uses objects. > Once the sort-based shuffle is changed to store serialized objects, we > should never be secretly doing hash-ish shuffle even when the user has > specified to use sort-based shuffle (because of its otherwise worse > performance). > [~rxin][~adav], masters of shuffle, it would be helpful to get agreement from > you on this proposal (and also a sanity check that I've correctly > characterized the issue). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org