Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19793 )

Change subject: IMPALA-6433: Add read support for PageHeaderV2
......................................................................


Patch Set 8:

(26 comments)

Thanks Csaba, great work.

http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@41
PS8, Line 41: tt
Nit: should be a single 't'.


http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@60
PS8, Line 60: tt
Nit: single 't'.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123
PS8, Line 123: non-empty data page
Nit: could be "non-empty data page's page header"?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123
PS8, Line 123: client
Nit: "the client" would be better.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@125
PS8, Line 125: other types of pages
Could mention that we also skip empty data pages also here.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@170
PS8, Line 170: has_rep_level_
Nit: maybe 'has_{rep/def}_levels_' (plural) would sound better.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc
File be/src/exec/parquet/parquet-column-chunk-reader.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@211
PS8, Line 211: skipping page types we don't care about
And also empty pages.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@260
PS8, Line 260: decompressor_.get() != nullptr
Can this be NULL if the header is not v2 or if 'is_compressed' is set? Isn't it 
an error then which we should signal?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@275
PS8, Line 275:   if (is_v2) {
Can we extract this block to a function?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@284
PS8, Line 284: def_level_size
It may be a bit misleading to blame 'def_level_size' if for example the 
'rep_level_size' is too big but not more than 'orig_uncompressed_size' and 
'def_level_size' is correct but together they exceed 'orig_compressed_size'.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@314
PS8, Line 314: uncompressed_size
In my opinion it would be cleaner if we didn't reuse 'uncompressed_size' but 
used a new variable for it, for example 'actual_uncompressed_size'. Maybe we 
could call 'uncompressed_size' 'expected_uncompressed_size' from the beginning.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@338
PS8, Line 338:           "Expected $1 bytes but got $2", filename(),
This error message is a bit misleading, I would think we tried  to read 
'compressed_size' but actually could read 'uncompressed_size'. But this is a 
metadata error, right?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@367
PS8, Line 367:   // If v2 data page, fill rep/def level info by parsing the 
beginning of the data. For
Isn't this the other way round? If v2 page, rep/def levels have already been 
read, if v1 we read them now after decompression.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@369
PS8, Line 369:   if (!is_v2) {
Can we extract this to a function?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@399
PS8, Line 399: page_info->rep_level_encoding = parquet::Encoding::RLE;
             :   page_info->def_level_encoding = parquet::Encoding::RLE;
We have already filled these fields on L292-293 and L370-371.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.h@415
PS8, Line 415:   /// In this member PLAIN_DICTIONARY is used both for pages 
with PLAIN_DICTIONARY and
Why not use RLE_DICTIONARY? That's the one that's not deprecated.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@359
PS8, Line 359: PLAIN_DICTIONARY
Why not RLE_DICTIONARY?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@1145
PS8, Line 1145: Status BaseScalarColumnReader::ReadDataPage() {
Can't we reuse ReadNextDataPageHeader() and possibly also ReadCurrentDataPage() 
in this function?


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h
File be/src/exec/parquet/parquet-level-decoder.h:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@56
PS8, Line 56: int
Should be int32_t, see comment for L59.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@58
PS8, Line 58:   static Status ParseRleByteSize(const string& filename,
Could add a short description.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@59
PS8, Line 59: int
We should use a fixed length type, for example int32_t as the thrift type is 
i32 
(https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L575).
 If 'int' is 8 bytes for some compiler/platform this may cause a buffer 
overflow.

The code before this change also used int32_t in Init().


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc
File be/src/exec/parquet/parquet-level-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc@68
PS8, Line 68: Init
Before the change Init() used to do validation but now it doesn't, the caller 
needs to call 'ValidateEncoding()'. It should be either documented or 'Init()' 
could call 'ValidateEncoding()' (the error would surface in 
parquet-column-readers.cc instead of parquet-column-chunk-reader.cc).


http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql@4099
PS8, Line 4099: alltypesagg_parquet_v2
Why do we have this table twice?


http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test:

http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test@44
PS8, Line 44: # Check that repetition levels are decoded correctly.
Nit: Could be moved to the test of def levels on L12.


http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@168
PS8, Line 168: alltypesagg_parquet_v2
Could use the variable 'table_name'.


http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@174
PS8, Line 174:       "select int_array from complextypestbl_parquet_v2;"
Could use the variable 'table_name'.



--
To view, visit http://gerrit.cloudera.org:8080/19793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca
Gerrit-Change-Number: 19793
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Tue, 25 Apr 2023 12:34:52 +0000
Gerrit-HasComments: Yes

Reply via email to