[ https://issues.apache.org/jira/browse/SPARK-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-2469: ------------------------------- Summary: Lower shuffle compression buffer memory usage (replace LZF with Snappy for default compression codec) (was: Lower shuffle compression buffer memory usage) > Lower shuffle compression buffer memory usage (replace LZF with Snappy for > default compression codec) > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-2469 > URL: https://issues.apache.org/jira/browse/SPARK-2469 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Reporter: Reynold Xin > > I was looking into the memory usage of shuffle and one annoying thing is the > default compression codec (LZF) is that the implementation we use allocates > buffers pretty generously. I did a simple experiment and found that creating > 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If we have a shuffle > task that uses 10k reducers and 32 threads running currently, the memory used > by the lzf stream alone would be ~ 60GB. > In comparison, Snappy only allocates ~ 65MB for every 1k SnappyOutputStream. > However, Snappy's compression is slightly lower than LZF's. In my experience, > it leads to 10 - 20% increase in size. Compression ratio does matter here > because we are sending data across the network. -- This message was sent by Atlassian JIRA (v6.2#6252)