[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Internal Jenkins has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Reviewed-on: http://gerrit.cloudera.org:8080/5343 Reviewed-by: Taras Bobrovytsky Tested-by: Internal Jenkins --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 141 insertions(+), 36 deletions(-) Approvals: Taras Bobrovytsky: Looks good to me, approved Internal Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 2: Code-Review+2 Carrying the +2. -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/5343/1/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 143: /// the time falls into 00:00:00..23:59:59.9 range. > Stale comment - we don't validate time. Done -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Internal Jenkins, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5343 to look at the new patch set (#2). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 141 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5343/2 -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Internal Jenkins has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 1: Verified-1 Build failed: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/586/ -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 1: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/5343/1/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 143: /// the time falls into 00:00:00..23:59:59.9 range. Stale comment - we don't validate time. -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 1: The previous review https://gerrit.cloudera.org/#/c/4968/ has been abandoned because I deleted a draft, which put the reivew in a weird state. (Gerrit bug) -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has uploaded a new change for review. http://gerrit.cloudera.org:8080/5343 Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 142 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5343/1 -- To view, visit http://gerrit.cloudera.org:8080/5343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/4968/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 657: > Unnecessary blank line. Done http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 151: const static int64_t NUMBER_OF_NANOSECONDS_IN_24H = > boost provides a direct way to construct the min and max date with Done PS7, Line 160: g buffer. The size of the buffer should be a > It looks like boost date_time doesn't support leap seconds: https://groups. After discussing offline, we agreed that we will not implement validation of number of seconds since midnight in this patch. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#10). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 142 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/10 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/4968/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 657: Unnecessary blank line. http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 151: const static int64_t NUMBER_OF_NANOSECONDS_IN_24H = boost provides a direct way to construct the min and max date with boost::gregorian::date(boost::date_time::min_date_time) boost::gregorian::date(boost::date_time::max_date_time) PS7, Line 160: g buffer. The size of the buffer should be a > Yes, that's true, some days can have an extra second: https://en.wikipedia. It looks like boost date_time doesn't support leap seconds: https://groups.google.com/forum/#!topic/boost-list/i_0h_amkk-c "Leap second support never got added to the library -- most applications can ignore them. In principle, of course, you can use the library to help you do the calculations. Basically they are like leap years except that they are irregular -- so you have to use a collection to perform the adjustment instead of an algorithm." I think we should allow for one leap second per day just so we don't regress anything where the data is potentially valid. It would be good to add some testing of this (maybe as a follow) to make sure we don't crash, even if we do return bogus results. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#9). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 150 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/9 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#9). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 150 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/9 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#9). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 152 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/9 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#9). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 152 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/9 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 8: (4 comments) http://gerrit.cloudera.org:8080/#/c/4968/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: ErrorMsg* msg = NULL; > see comment below. this pointer is never freed. what's the objection to pu Ok, I agree, it's better to put the logic inside ValidateSlot(). Line 484: // This point is reached if slot validation succeeds, but slot conversion fails. > but don't we need to validate the converted value? i.e. the timestamp might Convert slot should never return an invalid value. I tried it manually and it works correctly. Line 603: *msg = new ErrorMsg(TErrorCode::PARQUET_TIMESTAMP_OUT_OF_RANGE, > this memory is leaked. This is not a problem any more. http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: PS7, Line 160: g buffer. The size of the buffer should be a > What I'm asking is: don't some days legitimately have more than NUMBER_OF_N Yes, that's true, some days can have an extra second: https://en.wikipedia.org/wiki/Leap_second So far there hasn't been a case where a day has 2 leap seconds, but it's not impossible in the future. If we do this, the way seconds are printed to screen would be incorrect, Impala would print "24:00:00" instead of "23:59:60". Also, we would need to be able to cast string as timestamp that looks like "23:59:60" if we want to support leap seconds. I'm not really sure what's the right solution here. Should we just increase the constant by 1? By the way, even Oracle doesn't handle leap seconds properly (the number of seconds can't be greater that 59): http://stackoverflow.com/questions/31136211/how-to-handle-leap-seconds-in-oracle -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Dan Hecht has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 8: (4 comments) http://gerrit.cloudera.org:8080/#/c/4968/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: ErrorMsg* msg = NULL; see comment below. this pointer is never freed. what's the objection to putting the logic inside ValidateSlot() so that the memory allocation can stay automatic? Line 484: // This point is reached if slot validation succeeds, but slot conversion fails. but don't we need to validate the converted value? i.e. the timestamp might be within bounds before conversion, but not after. do we handle that? Line 603: *msg = new ErrorMsg(TErrorCode::PARQUET_TIMESTAMP_OUT_OF_RANGE, this memory is leaked. http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: PS7, Line 160: g buffer. The size of the buffer should be a > The result of a select statement would look like, for example: What I'm asking is: don't some days legitimately have more than NUMBER_OF_NANOSECNODS_IN_24H when leap seconds are in play? -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: ErrorMsg msg; > this has a non-trivial destructor, so we should probably only construct it I added some comments to make the code more clear. It doesn't seem elegant to pass tuple into ValidateSlot, so I fixed the problem with the non-trivial destructor differently. PS7, Line 474: val_ptr > you'll also have to be careful that we pass the right pointer here, dependi I think I'm passing it correctly in the latest patch. http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 147: /// Verifies that the timestamp date falls into a valid range (years 1400..). > update this comment to be clear that we also validate time. Done PS7, Line 156: l > doesn't this have to be "ll" (long long)? Yes, that's true (it's 64 bits). PS7, Line 160: time_.ticks() < NUMBER_OF_NANOSECONDS_IN_24H > I'm slightly worried about how this interacts with leap seconds. Are you s The result of a select statement would look like, for example: 1995-05-05 54:35:31 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#8). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet M testdata/workloads/functional-query/queries/QueryTest/grant_revoke.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 13 files changed, 154 insertions(+), 37 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/8 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Dan Hecht has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: ErrorMsg msg; this has a non-trivial destructor, so we should probably only construct it after checking NeedsValidation(). Also, why are NeedsValidation() and NeedsConversion() mutually exclusive? Actually, doesn't this break --convert_legacy_hive_parquet_utc_timestamps since we'll never take the conversion path for timestamp now? (Do we not have tests for that?) That is, I think both NeedsConversion() and NeedsValidation() can be true, and the code needs to be adjusted for that. Also consider tucking lines 475 to 480 inside ValidateSlot() and that way the ErrorMsg doesn't have to be passed back and forth, and it'll naturally keep the construction after the NeedsValidation() check. PS7, Line 474: val_ptr you'll also have to be careful that we pass the right pointer here, depending on how you address the comment above. http://gerrit.cloudera.org:8080/#/c/4968/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 147: /// Verifies that the timestamp date falls into a valid range (years 1400..). update this comment to be clear that we also validate time. PS7, Line 156: l doesn't this have to be "ll" (long long)? PS7, Line 160: time_.ticks() < NUMBER_OF_NANOSECONDS_IN_24H I'm slightly worried about how this interacts with leap seconds. Are you sure this is always illegal? If it doesn't crash, what behavior did you see (i.e. how do you know this is the correct bounds)? -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4968 to look at the new patch set (#7). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400.. year range and time falls into 00:00:00..23:59:59.9 range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/out_of_range_timestamp.parquet A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py 12 files changed, 149 insertions(+), 36 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/7 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: (10 comments) http://gerrit.cloudera.org:8080/#/c/4968/6//COMMIT_MSG Commit Message: PS6, Line 12: did't > typo Done PS6, Line 14: date > year* Done http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 526: parent_->parse_status_ = Status(*msg); > the claim is we are pulling this out to avoid status construction code, but Good point, Done. PS6, Line 598: > nit: extraneous space Done Line 599: if (!src->IsValidDate()) { > UNLIKELY Done, (also changed the order of the if statement). http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: PS6, Line 512: > was this an incorrect comment? I saw a comment in the implementation, so I didn't think it was true any more: "// AssembleCollection() advances child readers, so we don't need to call NextLevels()" I looked at the implementation, but it turns out AssembleCollection() actually calls NextLevels(). I'll put the comment back. PS6, Line 449: void* slot > shouldn't this be 'tuple'. but the fact that this compiled probably means Yeah, I don't think this is used anywhere, so I'll delete it. http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 156: return date_.day_number() >= MIN_DAY_NUMBER && date_.day_number() <= MAX_DAY_NUMBER; > are we sure these are the only invalidate TimestampValue's that may have be Yes, it turns out there are (but they don't cause a crash). I added validation of time_ as well. http://gerrit.cloudera.org:8080/#/c/4968/6/testdata/bad_parquet_data/README File testdata/bad_parquet_data/README: PS6, Line 16: out-of-range-timestamp.parq > not sure what you mean -- aren't you adding a parquet file to test this? co I'm adding a file to testdata/data, this readme is in testdata/bad_parquet_data http://gerrit.cloudera.org:8080/#/c/4968/6/testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test File testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test: PS6, Line 7: Parquet file '$NAM > why is the message repeated? Done -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Dan Hecht has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: (8 comments) http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 526: parent_->parse_status_ = Status(*msg); the claim is we are pulling this out to avoid status construction code, but didn't we already construct a status if we call this function (in LogOrReturnError())? PS6, Line 598: nit: extraneous space Line 599: if (!src->IsValidDate()) { UNLIKELY http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: PS6, Line 512: was this an incorrect comment? PS6, Line 449: void* slot shouldn't this be 'tuple'. but the fact that this compiled probably means that this method isn't actually ever called or defined (i.e. the one in BaseScalarColumnReader), and so this declaration should just be deleted. http://gerrit.cloudera.org:8080/#/c/4968/6/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 156: return date_.day_number() >= MIN_DAY_NUMBER && date_.day_number() <= MAX_DAY_NUMBER; are we sure these are the only invalidate TimestampValue's that may have been copied in? Are there any invalid time_ values? http://gerrit.cloudera.org:8080/#/c/4968/6/testdata/bad_parquet_data/README File testdata/bad_parquet_data/README: PS6, Line 16: out-of-range-timestamp.parq > The parquet file is no longer in this folder, so I will remove this comment not sure what you mean -- aren't you adding a parquet file to test this? could you just make whatever change you are proposing. http://gerrit.cloudera.org:8080/#/c/4968/6/testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test File testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test: PS6, Line 7: Parquet file '$NAM why is the message repeated? -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/6//COMMIT_MSG Commit Message: PS6, Line 14: date year* -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Laszlo Gaal has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/6//COMMIT_MSG Commit Message: PS6, Line 12: did't typo -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Laszlo Gaal Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/6/testdata/bad_parquet_data/README File testdata/bad_parquet_data/README: PS6, Line 16: out-of-range-timestamp.parq The parquet file is no longer in this folder, so I will remove this comment. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Alex Behm has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/4968/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 445: /// Writes the next value into the next slot in the *tuple using pool if necessary. > ... into the destination slot in tuple ... Done Line 480: } else if (UNLIKELY(NeedsConversion() && > Is it possible that an otherwise invalid value becomes valid after conversi No, because the input to conversion must be a valid date. http://gerrit.cloudera.org:8080/#/c/4968/5/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 512: /// the next slot in *tuple. > ... into the appropriate destination slot in 'tuple'. Done http://gerrit.cloudera.org:8080/#/c/4968/5/testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test File testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test: Line 5: DROP TABLE IF EXISTS invalid_parquet_timestamp; > setting up the table is already done in the .py file, right? Done -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#6). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we did't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the timestamp falls into 1400.. date range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/bad_parquet_data/README A testdata/data/out_of_range_timestamp.parquet A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M tests/query_test/test_scanners.py 9 files changed, 133 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/6 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Alex Behm has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 5: (4 comments) Almost ready for +1 http://gerrit.cloudera.org:8080/#/c/4968/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 445: /// Writes the next value into the next slot in the *tuple using pool if necessary. ... into the destination slot in tuple ... (it's not the "next" slot, it is always the same slot for a particular column reader) Line 480: } else if (UNLIKELY(NeedsConversion() && Is it possible that an otherwise invalid value becomes valid after conversion? http://gerrit.cloudera.org:8080/#/c/4968/5/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 512: /// the next slot in *tuple. ... into the appropriate destination slot in 'tuple'. (it's not the "next" slot, it is always the same slot for a particular column reader) http://gerrit.cloudera.org:8080/#/c/4968/5/testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test File testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test: Line 5: DROP TABLE IF EXISTS invalid_parquet_timestamp; setting up the table is already done in the .py file, right? -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#5). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we did't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the timestamp falls into 1400.. date range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/bad_parquet_data/README A testdata/data/out_of_range_timestamp.parquet A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test M tests/query_test/test_scanners.py 9 files changed, 136 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/5 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 4: (15 comments) http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 77: > I'll remove the empty lines I accidentally added here and in other files. Done Line 233: // TODO: consider adding validation for other types, such as DECIMAL. > Let's file a JIRA instead. You don't need to identify whether there is a bu Done Line 388: ReadSlot(tuple, pool); > single line if it fits Done Line 475: if (NeedsValidation() && !ValidateSlot(reinterpret_cast(slot))) { > The API seems a little clunky because we call a generic ValidateSlot() but Added ErrorMsg* to ValidateSlot. Line 503: DCHECK(!needs_conversion_); > DCHECK(false)? This DCHECK should be removed completely. Line 513: /// Sets the slot to NULL if the slot value is invalid, e.g., due to being > Update comment, doesn't set anything Done Line 515: bool ValidateSlot(T* src) { > const function Done Line 520: /// Pull out slow-path Status construction code from ReadRepetitionLevel()/ > remove the last part "from ReadRepetitionLevel..." I don't think that's acc Done Line 1084: void* slot = tuple->GetSlot(tuple_offset_); > just inline into L1087 Done http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 511: /// Recursively reads from children_ to assemble a single CollectionValue into > Update comment to reflect API change Done http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 22: > remove blank lines Done Line 150: inline bool IsValidDate() const { > nice :) http://gerrit.cloudera.org:8080/#/c/4968/4/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 312:"The valid date range is 1400..."), > mention the full range with year-month-day to make it clearer Done http://gerrit.cloudera.org:8080/#/c/4968/4/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: Line 382: hadoop fs -put -f ${IMPALA_HOME}/testdata/bad_parquet_data/out-of-range-timestamp.parq \ > Let's avoid doing this and instead create a test table on-the-fly in the te Done http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 248: def test_zero_rows(self, vector, unique_database): > it would be nice to have a new test test_timestamp_out_of_range that looks Done -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Alex Behm has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 4: (14 comments) http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 233: // TODO: consider adding validation for other types, such as DECIMAL. Let's file a JIRA instead. You don't need to identify whether there is a bug or not, the JIRA can say that we should investigate the DECIMAL case (and link to this JIRA). Line 388: ReadSlot(tuple, pool); single line if it fits Line 475: if (NeedsValidation() && !ValidateSlot(reinterpret_cast(slot))) { The API seems a little clunky because we call a generic ValidateSlot() but then set a type-specific error message. How would you generalize this for DECIMAL? Imo it'll make the code easier to follow without baking in the TIMESTAMP assumption here. Maybe ValidateSlot() could take an ErrorMsg* as output param which would be set when returning false. Or perhaps you have another idea. Line 503: DCHECK(!needs_conversion_); DCHECK(false)? Line 513: /// Sets the slot to NULL if the slot value is invalid, e.g., due to being Update comment, doesn't set anything Line 515: bool ValidateSlot(T* src) { const function Line 520: /// Pull out slow-path Status construction code from ReadRepetitionLevel()/ remove the last part "from ReadRepetitionLevel..." I don't think that's accurate anymore Line 1084: void* slot = tuple->GetSlot(tuple_offset_); just inline into L1087 http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 511: /// Recursively reads from children_ to assemble a single CollectionValue into Update comment to reflect API change http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 22: remove blank lines Line 150: inline bool IsValidDate() const { nice http://gerrit.cloudera.org:8080/#/c/4968/4/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 312:"The valid date range is 1400..."), mention the full range with year-month-day to make it clearer http://gerrit.cloudera.org:8080/#/c/4968/4/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: Line 382: hadoop fs -put -f ${IMPALA_HOME}/testdata/bad_parquet_data/out-of-range-timestamp.parq \ Let's avoid doing this and instead create a test table on-the-fly in the test. There's no need to store this in the snapshot and having the loading separate from the actual test can cause confusion. http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 248: def test_zero_rows(self, vector, unique_database): it would be nice to have a new test test_timestamp_out_of_range that looks like this one the test_corrput_files doesn't really follow best practices imo, and let's not fix that in this patch, so adding a new test seems easiest/cleanest also it's arguable whether these bad timestamp files are really corrupt, the values are certainly legal INT96, but the interpretation as a timestamp is the problem -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 44: #from tests.util.get_parquet_metadata import get_parquet_metadata I'll remove this. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4363: Add Parquet timestamp validation .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 77: I'll remove the empty lines I accidentally added here and in other files. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#4). Change subject: IMPALA-4363: Add Parquet timestamp validation .. IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we did't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the timestamp falls into 1400.. date range as we are scanning Parquet. Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/bad_parquet_data/README A testdata/bad_parquet_data/out-of-range-timestamp.parq M testdata/bin/create-load-data.sh M testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test M tests/query_test/test_scanners.py 10 files changed, 120 insertions(+), 14 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/4 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky