Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: WIP: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 4: (7 comments) Most of my comments are related to error handling - running to DCHECKs due to corrupt Parquet files would lead to fuzz test failures in the future, so it is better to return a non-OK status. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@399 PS4, Line 399: int64_t encoded_len = ParquetPlainEncoder::EncodedLen<PARQUET_TYPE>( Please add DCHECK_EQ(page_encoding_, parquet::Encoding::PLAIN); There are plans to add delta encoding support, and these DCHECKs will make it easier to find every place that needs to handle encodings differently. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088 PS4, Line 1088: DCHECK_LT(dict_start, data_start); This can be hit in a corrupt Parquet file. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@229 PS4, Line 229: EncodedLen Can you add a comment for the function? It may make sense to mention "Skip" in the function name. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@230 PS4, Line 230: int64_t This topic was touched in https://gerrit.cloudera.org/#/c/12172/ , but I would still prefer to use int32 instead of int64. The number of rows in a parquet file / row group is stored as int64, but the number of rows in a datapage is int32, so decoding stuff dees not have to deal with such large number of values. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@241 PS4, Line 241: DCHECK(false); It would be better to return -1 then 0 in this case. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-common.h@244 PS4, Line 244: return byte_size * num_values; The specialization for BYTE_ARRAY checks if the result will not pass buffer_end - I think that it should be checked here too. It would be also good to handle the case when the result of the multiplication does not fit into int32_t. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-page-index.cc@107 PS4, Line 107: DCHECK_GE(file_offset, base_offset_); : DCHECK_LE(file_offset + length, max_offset_); file_offset and length come from thrift structures and are not validated at other places, so I think that this should not lead to a DCHECK. I would simply return something like a "Corrupt page index in file" warning and ignore the page index. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 04 Feb 2019 16:24:43 +0000 Gerrit-HasComments: Yes