Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: WIP: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 4:

(7 comments)

Most of my comments are related to error handling - running to DCHECKs due to 
corrupt Parquet files would lead to fuzz test failures in the future, so it is 
better to return a non-OK status.

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@399
PS4, Line 399:     int64_t encoded_len = 
ParquetPlainEncoder::EncodedLen<PARQUET_TYPE>(
Please add  DCHECK_EQ(page_encoding_, parquet::Encoding::PLAIN);

There are plans to add delta encoding support, and these DCHECKs will make it 
easier to find every place that needs to handle encodings differently.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088
PS4, Line 1088:       DCHECK_LT(dict_start, data_start);
This can be hit in a corrupt Parquet file.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@229
PS4, Line 229: EncodedLen
Can you add a comment for the function? It may make sense to mention "Skip" in 
the function name.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@230
PS4, Line 230: int64_t
This topic was touched in https://gerrit.cloudera.org/#/c/12172/ , but I would 
still prefer to use int32 instead of int64. The number of rows in a parquet 
file / row group is stored as int64, but the number of rows in a datapage is 
int32, so decoding stuff dees not have to deal with such large number of values.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@241
PS4, Line 241:         DCHECK(false);
It would be better to return -1 then 0 in this case.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@244
PS4, Line 244:     return byte_size * num_values;
The specialization for BYTE_ARRAY checks if the result will not pass buffer_end 
- I think that it should be checked here too.

It would be also good to handle the case when the result of the multiplication 
does not fit into int32_t.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc@107
PS4, Line 107:   DCHECK_GE(file_offset, base_offset_);
             :   DCHECK_LE(file_offset + length, max_offset_);
file_offset and length come from thrift structures and are not validated at 
other places, so I think that this should not lead to a DCHECK. I would simply 
return something like a "Corrupt page index in file" warning and ignore the 
page index.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Feb 2019 16:24:43 +0000
Gerrit-HasComments: Yes

Reply via email to