Abhishek Rawat has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/13582 )
Change subject: IMPALA-8617: Add support for lz4 in parquet ...................................................................... IMPALA-8617: Add support for lz4 in parquet A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents the block compression scheme used by Hadoop. Its similar to SNAPPY_BLOCKED as far as the block format is concerned, with the only difference being the codec used for compression and decompression. Added Lz4BlockCompressor and Lz4BlockDecompressor classes for compressing and decompressing parquet data using Hadoop's lz4 block compression scheme. The Lz4BlockCompressor treats the input as a single block and generates a compressed block with following layout <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> The hdfs parquet table writer should call the Lz4BlockCompressor using the ideal input size (unit of compression in parquet is a page), and so the Lz4BlockCompressor does not further break down the input into smaller blocks. The Lz4BlockDecompressor on the other hand should be compatible with blocks written by Impala and other engines in Hadoop ecosystem. It can decompress compressed data in following format <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> ... <4 byte big endian compressed size> <lz4 compressed block> ... <repeated untill uncompressed size from outer block is consumed> Externally users can now set the lz4 codec for parquet using: set COMPRESSION_CODEC=lz4 This gets translated into LZ4_BLOCKED codec for the HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet data, the LZ4_BLOCKED codec is used. Testing: - Added unit tests for LZ4_BLOCKED in decompress-test.cc - Added unit tests for Hadoop compatibility in decompress-test.cc, basically being able to decompress an outer block with multiple inner blocks (the Lz4BlockDecompressor description above) - Added interoperability tests for Hive and Impala for all parquet codecs. New test added to tests/custom_cluster/test_hive_parquet_codec_interop.py Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/service/query-options-test.cc M be/src/util/codec.cc M be/src/util/compress.cc M be/src/util/compress.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc M be/src/util/decompress.h M common/thrift/CatalogObjects.thrift M common/thrift/generate_error_codes.py M testdata/workloads/functional-query/queries/QueryTest/set.test M tests/common/test_dimensions.py A tests/custom_cluster/test_hive_parquet_codec_interop.py M tests/query_test/test_insert.py M tests/query_test/test_insert_parquet.py 18 files changed, 439 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/13582/7 -- To view, visit http://gerrit.cloudera.org:8080/13582 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c Gerrit-Change-Number: 13582 Gerrit-PatchSet: 7 Gerrit-Owner: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>