Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19793 )
Change subject: IMPALA-6433: Add read support for PageHeaderV2 ...................................................................... Patch Set 8: (26 comments) Thanks Csaba, great work. http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@41 PS8, Line 41: tt Nit: should be a single 't'. http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@60 PS8, Line 60: tt Nit: single 't'. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h File be/src/exec/parquet/parquet-column-chunk-reader.h: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123 PS8, Line 123: non-empty data page Nit: could be "non-empty data page's page header"? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123 PS8, Line 123: client Nit: "the client" would be better. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@125 PS8, Line 125: other types of pages Could mention that we also skip empty data pages also here. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@170 PS8, Line 170: has_rep_level_ Nit: maybe 'has_{rep/def}_levels_' (plural) would sound better. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc File be/src/exec/parquet/parquet-column-chunk-reader.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@211 PS8, Line 211: skipping page types we don't care about And also empty pages. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@260 PS8, Line 260: decompressor_.get() != nullptr Can this be NULL if the header is not v2 or if 'is_compressed' is set? Isn't it an error then which we should signal? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@275 PS8, Line 275: if (is_v2) { Can we extract this block to a function? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@284 PS8, Line 284: def_level_size It may be a bit misleading to blame 'def_level_size' if for example the 'rep_level_size' is too big but not more than 'orig_uncompressed_size' and 'def_level_size' is correct but together they exceed 'orig_compressed_size'. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@314 PS8, Line 314: uncompressed_size In my opinion it would be cleaner if we didn't reuse 'uncompressed_size' but used a new variable for it, for example 'actual_uncompressed_size'. Maybe we could call 'uncompressed_size' 'expected_uncompressed_size' from the beginning. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@338 PS8, Line 338: "Expected $1 bytes but got $2", filename(), This error message is a bit misleading, I would think we tried to read 'compressed_size' but actually could read 'uncompressed_size'. But this is a metadata error, right? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@367 PS8, Line 367: // If v2 data page, fill rep/def level info by parsing the beginning of the data. For Isn't this the other way round? If v2 page, rep/def levels have already been read, if v1 we read them now after decompression. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@369 PS8, Line 369: if (!is_v2) { Can we extract this to a function? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@399 PS8, Line 399: page_info->rep_level_encoding = parquet::Encoding::RLE; : page_info->def_level_encoding = parquet::Encoding::RLE; We have already filled these fields on L292-293 and L370-371. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.h@415 PS8, Line 415: /// In this member PLAIN_DICTIONARY is used both for pages with PLAIN_DICTIONARY and Why not use RLE_DICTIONARY? That's the one that's not deprecated. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@359 PS8, Line 359: PLAIN_DICTIONARY Why not RLE_DICTIONARY? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@1145 PS8, Line 1145: Status BaseScalarColumnReader::ReadDataPage() { Can't we reuse ReadNextDataPageHeader() and possibly also ReadCurrentDataPage() in this function? http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h File be/src/exec/parquet/parquet-level-decoder.h: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@56 PS8, Line 56: int Should be int32_t, see comment for L59. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@58 PS8, Line 58: static Status ParseRleByteSize(const string& filename, Could add a short description. http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@59 PS8, Line 59: int We should use a fixed length type, for example int32_t as the thrift type is i32 (https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L575). If 'int' is 8 bytes for some compiler/platform this may cause a buffer overflow. The code before this change also used int32_t in Init(). http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc File be/src/exec/parquet/parquet-level-decoder.cc: http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc@68 PS8, Line 68: Init Before the change Init() used to do validation but now it doesn't, the caller needs to call 'ValidateEncoding()'. It should be either documented or 'Init()' could call 'ValidateEncoding()' (the error would surface in parquet-column-readers.cc instead of parquet-column-chunk-reader.cc). http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql@4099 PS8, Line 4099: alltypesagg_parquet_v2 Why do we have this table twice? http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test File testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test: http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test@44 PS8, Line 44: # Check that repetition levels are decoded correctly. Nit: Could be moved to the test of def levels on L12. http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@168 PS8, Line 168: alltypesagg_parquet_v2 Could use the variable 'table_name'. http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@174 PS8, Line 174: "select int_array from complextypestbl_parquet_v2;" Could use the variable 'table_name'. -- To view, visit http://gerrit.cloudera.org:8080/19793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca Gerrit-Change-Number: 19793 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Tue, 25 Apr 2023 12:34:52 +0000 Gerrit-HasComments: Yes
