[ https://issues.apache.org/jira/browse/SPARK-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-7375: ----------------------------------- Assignee: Apache Spark > Avoid defensive copying in SQL exchange operator when sort-based shuffle > buffers data in serialized form > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-7375 > URL: https://issues.apache.org/jira/browse/SPARK-7375 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Josh Rosen > Assignee: Apache Spark > > The original sort-based shuffle buffers shuffle input records in memory while > sorting them. This causes problems when mutable records are presented to the > shuffle, which happens in Spark SQL's Exchange operator. To work around this > issue, SPARK-2967 and SPARK-4479 added defensive copying of shuffle inputs in > the Exchange operator when sort-based shuffle is enabled. > I think that [~sandyr]'s recent patch for enabling serialization of records > in sort-based shuffle (SPARK-4550) and my proposed {{unsafe}}-based shuffle > path (SPARK-7081) may allow us to avoid this defensive copying in certain > cases (since our patches cause records to be serialized one-at-a-time and > remove the buffering of deserialized records). > As mentioned in SPARK-4479, a long-term fix for this issue might be to add > hooks for informing the shuffle about object (im)mutability in order to allow > the shuffle layer to decide whether to copy. In the meantime, though, I think > that we should just extend the checks added in SPARK-4479 to avoid copies > when these new serialized sort paths are used. > /cc [~rxin] [~marmbrus] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org