Alex Behm has posted comments on this change.

Change subject: IMPALA-4363: Add Parquet timestamp validation
......................................................................


Patch Set 4:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 233:     // TODO: consider adding validation for other types, such as 
DECIMAL.
Let's file a JIRA instead. You don't need to identify whether there is a bug or 
not, the JIRA can say that we should investigate the DECIMAL case (and link to 
this JIRA).


Line 388:               ReadSlot<IS_DICT_ENCODED>(tuple, pool);
single line if it fits


Line 475:     if (NeedsValidation() && 
!ValidateSlot(reinterpret_cast<T*>(slot))) {
The API seems a little clunky because we call a generic ValidateSlot() but then 
set a type-specific error message. How would you generalize this for DECIMAL?

Imo it'll make the code easier to follow without baking in the TIMESTAMP 
assumption here.

Maybe ValidateSlot() could take an ErrorMsg* as output param which would be set 
when returning false. Or perhaps you have another idea.


Line 503:     DCHECK(!needs_conversion_);
DCHECK(false)?


Line 513:   /// Sets the slot to NULL if the slot value is invalid, e.g., due 
to being
Update comment, doesn't set anything


Line 515:   bool ValidateSlot(T* src) {
const function


Line 520:   /// Pull out slow-path Status construction code from 
ReadRepetitionLevel()/
remove the last part "from ReadRepetitionLevel..." I don't think that's 
accurate anymore


Line 1084:   void* slot = tuple->GetSlot(tuple_offset_);
just inline into L1087


http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 511:   /// Recursively reads from children_ to assemble a single 
CollectionValue into
Update comment to reflect API change


http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 22: 
remove blank lines


Line 150:   inline bool IsValidDate() const {
nice


http://gerrit.cloudera.org:8080/#/c/4968/4/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 312:    "The valid date range is 1400..9999."),
mention the full range with year-month-day to make it clearer


http://gerrit.cloudera.org:8080/#/c/4968/4/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 382:   hadoop fs -put -f 
${IMPALA_HOME}/testdata/bad_parquet_data/out-of-range-timestamp.parq \
Let's avoid doing this and instead create a test table on-the-fly in the test. 
There's no need to store this in the snapshot and having the loading separate 
from the actual test can cause confusion.


http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 248:   def test_zero_rows(self, vector, unique_database):
it would be nice to have a new test test_timestamp_out_of_range that looks like 
this one

the test_corrput_files doesn't really follow best practices imo, and let's not 
fix that in this patch, so adding a new test seems easiest/cleanest

also it's arguable whether these bad timestamp files are really corrupt, the 
values are certainly legal INT96, but the interpretation as a timestamp is the 
problem


-- 
To view, visit http://gerrit.cloudera.org:8080/4968
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to