Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x higher than LZF[2], but memory used is 10x smaller than LZF (16k vs 190k).
[1] https://github.com/jpountz/lz4-java [2] http://ning.github.io/jvm-compressor-benchmark/results/calgary/roundtrip-2013-06-06/index.html On Mon, Jul 14, 2014 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote: > > Hi Spark devs, > > I was looking into the memory usage of shuffle and one annoying thing is > the default compression codec (LZF) is that the implementation we use > allocates buffers pretty generously. I did a simple experiment and found > that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If > we have a shuffle task that uses 10k reducers and 32 threads running > currently, the memory used by the lzf stream alone would be ~ 60GB. > > In comparison, Snappy only allocates ~ 65MB for every > 1k SnappyOutputStream. However, Snappy's compression is slightly lower than > LZF's. In my experience, it leads to 10 - 20% increase in size. Compression > ratio does matter here because we are sending data across the network. > > In future releases we will likely change the shuffle implementation to open > less streams. Until that happens, I'm looking for compression codec > implementations that are fast, allocate small buffers, and have decent > compression ratio. > > Does anybody on this list have any suggestions? If not, I will submit a > patch for 1.1 that replaces LZF with Snappy for the default compression > codec to lower memory usage. > > > allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567