Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implementation we use allocates buffers pretty generously. I did a simple experiment and found that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If we have a shuffle task that uses 10k reducers and 32 threads running currently, the memory used by the lzf stream alone would be ~ 60GB.
In comparison, Snappy only allocates ~ 65MB for every 1k SnappyOutputStream. However, Snappy's compression is slightly lower than LZF's. In my experience, it leads to 10 - 20% increase in size. Compression ratio does matter here because we are sending data across the network. In future releases we will likely change the shuffle implementation to open less streams. Until that happens, I'm looking for compression codec implementations that are fast, allocate small buffers, and have decent compression ratio. Does anybody on this list have any suggestions? If not, I will submit a patch for 1.1 that replaces LZF with Snappy for the default compression codec to lower memory usage. allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567