[kudu-CR] KUDU-1600 (part 2): store uncompressed blocks when codec can't compress

Todd Lipcon (Code Review) Wed, 11 Jan 2017 18:10:15 -0800

Hello Jean-Daniel Cryans, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit


    http://gerrit.cloudera.org:8080/5679

to look at the new patch set (#3).

Change subject: KUDU-1600 (part 2): store uncompressed blocks when codec can't 
compress
......................................................................

KUDU-1600 (part 2): store uncompressed blocks when codec can't compress

This changes the compressed block header for CFile v2 so that it only
includes the uncompressed block length, and not a redundant copy of the
compressed block length (which is already known by the length of the
block itself).

In addition, this enables a special behavior if the stored uncompressed
block length is exactly equal to the length of the compressed data. This
signifies that the data has not been compressed and that the reader
should not execute the configured codec.

On the write side, we always execute the codec, and in the case that the
resulting compression ratio is not good enough, we just store the
uncompressed data as above. By default, the compression ratio is 0.9x
(i.e at least 10% space reduction). It can be configured by an
experimental flag.

Initially, the JIRA had described doing this based on a policy rather
than based on the actual output of the codec. However, during
development I realized that, even with encodings like BIT_SHUFFLE, there
are actually some patterns that can benefit from a second pass of LZ4.
And even if the second pass is not effective, it isn't too expensive to
try a second pass on the write side anyway. See [1] for a discussion of
a dataset that benefits from multiple passes of compression.

Along the way, made a few changes to how the compressed block decoder is
implemented:

* Use strings::Substitute instead of StringPrintf (easier to read)
* Cleaned up unnecessary header includes
* Made the CompressedBlockDecoder a short-lifetime stack object rather
  than an allocated member of CFileReader. This avoids an allocation and
  indirection, and additionally allows us to put more state inside the
  object itself without worrying about thread safety, etc.
* The CompressedBlockDecoder is now stateful, making for an
  easier-to-use API.

[1] https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ

Change-Id: I8306f1f2139ebf7500a83ee46b4ccd6c7b5e137f
---
M src/kudu/cfile/block_compression.cc
M src/kudu/cfile/block_compression.h
M src/kudu/cfile/cfile_reader.cc
M src/kudu/cfile/cfile_reader.h
M src/kudu/cfile/cfile_writer.cc
M src/kudu/cfile/compression-test.cc
6 files changed, 214 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/5679/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5679
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8306f1f2139ebf7500a83ee46b4ccd6c7b5e137f
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

[kudu-CR] KUDU-1600 (part 2): store uncompressed blocks when codec can't compress

Reply via email to