Hello Jean-Daniel Cryans, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/5679 to look at the new patch set (#3). Change subject: KUDU-1600 (part 2): store uncompressed blocks when codec can't compress ...................................................................... KUDU-1600 (part 2): store uncompressed blocks when codec can't compress This changes the compressed block header for CFile v2 so that it only includes the uncompressed block length, and not a redundant copy of the compressed block length (which is already known by the length of the block itself). In addition, this enables a special behavior if the stored uncompressed block length is exactly equal to the length of the compressed data. This signifies that the data has not been compressed and that the reader should not execute the configured codec. On the write side, we always execute the codec, and in the case that the resulting compression ratio is not good enough, we just store the uncompressed data as above. By default, the compression ratio is 0.9x (i.e at least 10% space reduction). It can be configured by an experimental flag. Initially, the JIRA had described doing this based on a policy rather than based on the actual output of the codec. However, during development I realized that, even with encodings like BIT_SHUFFLE, there are actually some patterns that can benefit from a second pass of LZ4. And even if the second pass is not effective, it isn't too expensive to try a second pass on the write side anyway. See [1] for a discussion of a dataset that benefits from multiple passes of compression. Along the way, made a few changes to how the compressed block decoder is implemented: * Use strings::Substitute instead of StringPrintf (easier to read) * Cleaned up unnecessary header includes * Made the CompressedBlockDecoder a short-lifetime stack object rather than an allocated member of CFileReader. This avoids an allocation and indirection, and additionally allows us to put more state inside the object itself without worrying about thread safety, etc. * The CompressedBlockDecoder is now stateful, making for an easier-to-use API. [1] https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ Change-Id: I8306f1f2139ebf7500a83ee46b4ccd6c7b5e137f --- M src/kudu/cfile/block_compression.cc M src/kudu/cfile/block_compression.h M src/kudu/cfile/cfile_reader.cc M src/kudu/cfile/cfile_reader.h M src/kudu/cfile/cfile_writer.cc M src/kudu/cfile/compression-test.cc 6 files changed, 214 insertions(+), 115 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/5679/3 -- To view, visit http://gerrit.cloudera.org:8080/5679 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8306f1f2139ebf7500a83ee46b4ccd6c7b5e137f Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>