[ https://issues.apache.org/jira/browse/SPARK-7041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-7041: ------------------------------ Summary: Avoid writing empty files in BypassMergeSortShuffleWriter (was: Avoid writing empty files in ExternalSorter) > Avoid writing empty files in BypassMergeSortShuffleWriter > --------------------------------------------------------- > > Key: SPARK-7041 > URL: https://issues.apache.org/jira/browse/SPARK-7041 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Reporter: Josh Rosen > Assignee: Josh Rosen > > In ExternalSorter, we may end up opening disk writers files for empty > partitions; this occurs because we manually call {{open()}} after creating > the writer, causing serialization and compression input streams to be > created; these streams may write headers to the output stream, resulting in > non-zero-length files being created for partitions that contain no records. > This is unnecessary, though, since the disk object writer will automatically > open itself when the first write is performed. Removing this eager > {{open()}} call and rewriting the consumers to cope with the non-existence of > empty files results in a large performance benefit for certain sparse > workloads when using sort-based shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org