[Impala-ASF-CR] IMPALA-8617: Add support for lz4 in parquet

Abhishek Rawat (Code Review) Fri, 14 Jun 2019 12:57:42 -0700

Abhishek Rawat has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/13582 )


Change subject: IMPALA-8617: Add support for lz4 in parquet
......................................................................

IMPALA-8617: Add support for lz4 in parquet

A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to
distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents
the block compression scheme used by Hadoop. Its similar to
SNAPPY_BLOCKED as far as the block format is concerned, with the only
difference being the codec used for compression and decompression.

Added Lz4BlockCompressor and Lz4BlockDecompressor classes for
compressing and decompressing parquet data using Hadoop's
lz4 block compression scheme.

The Lz4BlockCompressor treats the input
as a single block and generates a compressed block with following layout
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
The hdfs parquet table writer should call the Lz4BlockCompressor
using the ideal input size (unit of compression in parquet is a page),
and so the Lz4BlockCompressor does not further break down the input
into smaller blocks.

The Lz4BlockDecompressor on the other hand should be compatible with
blocks written by Impala and other engines in Hadoop ecosystem. It can
decompress compressed data in following format
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <repeated untill uncompressed size from outer block is consumed>

Externally users can now set the lz4 codec for parquet using:
  set COMPRESSION_CODEC=lz4
This gets translated into LZ4_BLOCKED codec for the
HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet
data, the LZ4_BLOCKED codec is used.

Testing:
 - Added unit tests for LZ4_BLOCKED in decompress-test.cc
 - Added unit tests for Hadoop compatibility in decompress-test.cc,
   basically being able to decompress an outer block with multiple inner
   blocks (the Lz4BlockDecompressor description above)
 - Added interoperability tests for Hive and Impala for all parquet
   codecs. New test added to
   tests/custom_cluster/test_hive_parquet_codec_interop.py

Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/service/query-options-test.cc
M be/src/util/codec.cc
M be/src/util/compress.cc
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M common/thrift/CatalogObjects.thrift
M common/thrift/generate_error_codes.py
M testdata/workloads/functional-query/queries/QueryTest/set.test
M tests/common/test_dimensions.py
A tests/custom_cluster/test_hive_parquet_codec_interop.py
M tests/query_test/test_insert.py
M tests/query_test/test_insert_parquet.py
18 files changed, 439 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/13582/7
--
To view, visit http://gerrit.cloudera.org:8080/13582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c
Gerrit-Change-Number: 13582
Gerrit-PatchSet: 7
Gerrit-Owner: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

[Impala-ASF-CR] IMPALA-8617: Add support for lz4 in parquet

Reply via email to