Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19793 )

Change subject: IMPALA-6433: Add read support for PageHeaderV2
......................................................................


Patch Set 10:

(27 comments)

http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@41
PS8, Line 41: ti
> Nit: should be a single 't'.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8//COMMIT_MSG@60
PS8, Line 60: ti
> Nit: single 't'.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123
PS8, Line 123: non-empty data page
> Nit: could be "non-empty data page's page header"?
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@123
PS8, Line 123: h the
> Nit: "the client" would be better.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@125
PS8, Line 125: other types of pages
> Could mention that we also skip empty data pages also here.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.h@170
PS8, Line 170:
> Nit: maybe 'has_{rep/def}_levels_' (plural) would sound better.
I prefer the singular form in this case, not sure why :)


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc
File be/src/exec/parquet/parquet-column-chunk-reader.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@211
PS8, Line 211: skipping page types we don't care about
> And also empty pages.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@260
PS8, Line 260: 
> Can this be NULL if the header is not v2 or if 'is_compressed' is set? Isn'
It is possible to encounter files like this and it is not clearly an error 
(is_compressed is optional and true by default)


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@275
PS8, Line 275:     RETURN_IF_ERROR(ParquetLevelDecoder::ParseRleByteSize(
> Can we extract this block to a function?
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@284
PS8, Line 284:
> It may be a bit misleading to blame 'def_level_size' if for example the 're
Changed the error message.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@314
PS8, Line 314: Type::DATA_PAGE_V
> In my opinion it would be cleaner if we didn't reuse 'uncompressed_size' bu
I changed it but I am not sure that it is better like this - it is used as both 
input and output argument in ProcessBlock32 so it must be set to the expected 
size (used as maximum size) and will be overwritten by the actual size.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@367
PS8, Line 367:     data = decompressed_buffer;
> Isn't this the other way round? If v2 page, rep/def levels have already bee
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@369
PS8, Line 369:     if (has_slot_desc) {
> Can we extract this to a function?
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-chunk-reader.cc@399
PS8, Line 399:   if (has_slot_desc) {
             :       // Use original sizes (includes levels in v2) in th
> We have already filled these fields on L292-293 and L370-371.
Done


http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc
File be/src/exec/parquet/parquet-column-chunk-reader.cc:

http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc@286
PS9, Line 286: Status ParquetColumnChunkReader::ProcessRepDefLevelsInDataPageV2(
> line too long (108 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/19793/9/be/src/exec/parquet/parquet-column-chunk-reader.cc@333
PS9, Line 333:   if (is_v2) {
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@359
PS8, Line 359: PLAIN_DICTIONARY
> Why not RLE_DICTIONARY?
Kept it as PLAIN_DICTIONARY to avoid touching unrelated parts of the code. I 
think that it would be better to clean this up when Impala will actually switch 
to writing V2 pages by default, as currently we mainly see PLAIN_DICTIONARY 
encodings.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-column-readers.cc@1145
PS8, Line 1145: Status BaseScalarColumnReader::ReadDataPage() {
> Can't we reuse ReadNextDataPageHeader() and possibly also ReadCurrentDataPa
I thought about this too but wouldn't to it in this patch. The difference of 
these functions comes for skipping vs normal reading context and can have some 
subtle differences.


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h
File be/src/exec/parquet/parquet-level-decoder.h:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@56
PS8, Line 56:
> Should be int32_t, see comment for L59.
done - int32_t vs int is used a bit chaotically in Impala and I think that lot 
of things would break if int was not 32 bit


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@58
PS8, Line 58:   static Status ParseRleByteSize(const string& filename,
> Could add a short description.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.h@59
PS8, Line 59: int
> We should use a fixed length type, for example int32_t as the thrift type i
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc
File be/src/exec/parquet/parquet-level-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/19793/8/be/src/exec/parquet/parquet-level-decoder.cc@68
PS8, Line 68: Init
> Before the change Init() used to do validation but now it doesn't, the call
Removed 'encoding' and added DCHECKs for it in the caller code. It seems very 
unlikely that we will ever support another level encoding.


http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/datasets/functional/functional_schema_template.sql@4099
PS8, Line 4099: alltypesagg_parquet_v2
> Why do we have this table twice?
Probably was a mistake.


http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test:

http://gerrit.cloudera.org:8080/#/c/19793/8/testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test@44
PS8, Line 44: ====
> Nit: Could be moved to the test of def levels on L12.
Moved it before the error messages. It checks both value and rep level decoding 
so I wouldn't put it before line 33


http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@168
PS8, Line 168: ueries = [
> Could use the variable 'table_name'.
Done


http://gerrit.cloudera.org:8080/#/c/19793/8/tests/query_test/test_scanners_fuzz.py@174
PS8, Line 174:
> Could use the variable 'table_name'.
Done


http://gerrit.cloudera.org:8080/#/c/19793/9/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19793/9/tests/query_test/test_scanners_fuzz.py@170
PS9, Line 170:
> flake8: E221 multiple spaces before operator
Done



--
To view, visit http://gerrit.cloudera.org:8080/19793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca
Gerrit-Change-Number: 19793
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Wed, 03 May 2023 18:42:55 +0000
Gerrit-HasComments: Yes

Reply via email to