Alex Behm has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation ......................................................................
Patch Set 4: (14 comments) http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 233: // TODO: consider adding validation for other types, such as DECIMAL. Let's file a JIRA instead. You don't need to identify whether there is a bug or not, the JIRA can say that we should investigate the DECIMAL case (and link to this JIRA). Line 388: ReadSlot<IS_DICT_ENCODED>(tuple, pool); single line if it fits Line 475: if (NeedsValidation() && !ValidateSlot(reinterpret_cast<T*>(slot))) { The API seems a little clunky because we call a generic ValidateSlot() but then set a type-specific error message. How would you generalize this for DECIMAL? Imo it'll make the code easier to follow without baking in the TIMESTAMP assumption here. Maybe ValidateSlot() could take an ErrorMsg* as output param which would be set when returning false. Or perhaps you have another idea. Line 503: DCHECK(!needs_conversion_); DCHECK(false)? Line 513: /// Sets the slot to NULL if the slot value is invalid, e.g., due to being Update comment, doesn't set anything Line 515: bool ValidateSlot(T* src) { const function Line 520: /// Pull out slow-path Status construction code from ReadRepetitionLevel()/ remove the last part "from ReadRepetitionLevel..." I don't think that's accurate anymore Line 1084: void* slot = tuple->GetSlot(tuple_offset_); just inline into L1087 http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 511: /// Recursively reads from children_ to assemble a single CollectionValue into Update comment to reflect API change http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 22: remove blank lines Line 150: inline bool IsValidDate() const { nice http://gerrit.cloudera.org:8080/#/c/4968/4/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 312: "The valid date range is 1400..9999."), mention the full range with year-month-day to make it clearer http://gerrit.cloudera.org:8080/#/c/4968/4/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: Line 382: hadoop fs -put -f ${IMPALA_HOME}/testdata/bad_parquet_data/out-of-range-timestamp.parq \ Let's avoid doing this and instead create a test table on-the-fly in the test. There's no need to store this in the snapshot and having the loading separate from the actual test can cause confusion. http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 248: def test_zero_rows(self, vector, unique_database): it would be nice to have a new test test_timestamp_out_of_range that looks like this one the test_corrput_files doesn't really follow best practices imo, and let's not fix that in this patch, so adding a new test seems easiest/cleanest also it's arguable whether these bad timestamp files are really corrupt, the values are certainly legal INT96, but the interpretation as a timestamp is the problem -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-HasComments: Yes