Joe McDonnell created IMPALA-10629: -------------------------------------- Summary: bin/load-data.py does not respect compression codec for parquet Key: IMPALA-10629 URL: https://issues.apache.org/jira/browse/IMPALA-10629 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.0 Reporter: Joe McDonnell
If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it silently ignores the codec and uses Snappy under the covers: {noformat} $ bin/load-data.py -w tpch --table_formats=parquet/zstd $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/ Found 4 items -rw-r--r-- 3 joe supergroup 72305126 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000000_1779607968_data.0.parq -rw-r--r-- 3 joe supergroup 58526717 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000001_53336944_data.0.parq -rw-r--r-- 3 joe supergroup 72584796 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq drwxr-xr-x - joe supergroup 0 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging $ hdfs dfs -copyToLocal /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq $ parquet-reader 02444051906c734d-3b49d6c900000002_53336944_data.0.parq ... [10] = ColumnChunk { 02: file_offset (i64) = 37053592, 03: meta_data (struct) = ColumnMetaData { 01: type (i32) = 6, 02: encodings (list) = list<i32>[2] { [0] = 2, [1] = 3, }, 03: path_in_schema (list) = list<string>[1] { [0] = "l_shipdate", }, 04: codec (i32) = 1, <------ SNAPPY!!!! ...{noformat} Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec query option when loading parquet. It is a bug that this silently does the wrong thing, but the actual support is more of a feature request. Being able to load ZSTD (or other compression) parquet makes it easier to do performance comparisons for those compression codecs on the perf-AB-test upstream job ([https://jenkins.impala.io/job/perf-AB-test/]). -- This message was sent by Atlassian Jira (v8.3.4#803005)