Joe McDonnell created IMPALA-10629:
--------------------------------------

             Summary: bin/load-data.py does not respect compression codec for 
parquet
                 Key: IMPALA-10629
                 URL: https://issues.apache.org/jira/browse/IMPALA-10629
             Project: IMPALA
          Issue Type: Bug
          Components: Infrastructure
    Affects Versions: Impala 4.0
            Reporter: Joe McDonnell


If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it 
silently ignores the codec and uses Snappy under the covers:
{noformat}
$ bin/load-data.py -w tpch --table_formats=parquet/zstd
$ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
Found 4 items
-rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 
/test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000000_1779607968_data.0.parq
-rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 
/test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000001_53336944_data.0.parq
-rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 
/test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
drwxr-xr-x   - joe supergroup          0 2021-03-31 17:01 
/test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
$ hdfs dfs -copyToLocal 
/test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
$ parquet-reader 02444051906c734d-3b49d6c900000002_53336944_data.0.parq
...
        [10] = ColumnChunk {
          02: file_offset (i64) = 37053592,
          03: meta_data (struct) = ColumnMetaData {
            01: type (i32) = 6,
            02: encodings (list) = list<i32>[2] {
              [0] = 2,
              [1] = 3,
            },
            03: path_in_schema (list) = list<string>[1] {
              [0] = "l_shipdate",
            },
            04: codec (i32) = 1, <------ SNAPPY!!!!

...{noformat}
Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec 
query option when loading parquet. It is a bug that this silently does the 
wrong thing, but the actual support is more of a feature request.

Being able to load ZSTD (or other compression) parquet makes it easier to do 
performance comparisons for those compression codecs on the perf-AB-test 
upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to