[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16278 to look at the new patch set (#2). Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. IMPALA-10005: Fix Snappy decompression for non-block filesystems Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED type compression in the backend. However, for non-block filesystems, the frontend is incorrectly passing THdfsCompression::SNAPPY instead. On debug builds, this leads to a DCHECK when trying to read Snappy-compressed text. On release builds, it fails to decompress the data. This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED for Snappy-compressed text. This reworks query_test/test_compressed_formats.py to provide better coverage: - Changed the RC and Seq test cases to verify that the file extension doesn't matter. Added Avro to this case as well. - Fixed the text case to use appropriate extensions (fixing IMPALA-9004) - Changed the utility function so it doesn't use Hive. This allows it to be enabled on non-HDFS filesystems like S3. - Changed the test to use unique_database and allow parallel execution. - Changed the test to run in the core job, so it now has coverage on the usual S3 test configuration. It is reasonably quick (1-2 minutes) and runs in parallel. Testing: - Exhaustive job - Core s3 job - Changed the frontend to force it to use the code for non-block filesystems (i.e. the TFileSplitGeneratorSpec code) and verified that it is now able to read Snappy-compressed text. Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac --- M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M tests/query_test/test_compressed_formats.py 2 files changed, 132 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/16278/2 -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6787/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 04 Aug 2020 15:42:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6222/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 04 Aug 2020 16:38:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6222/ -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Tue, 04 Aug 2020 21:41:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6226/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Wed, 05 Aug 2020 00:44:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Wed, 05 Aug 2020 05:58:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: (3 comments) mostly questions http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@9 PS2, Line 9: Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED is this true of all engines? do all engines use SNAPPY_BLOCKED for snappy-compressed text? I guess, put another way, if we write snappy-compressed text via Hive, can Impala still read it after this change? http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@10 PS2, Line 10: for non-block filesystems why does this only happen for non-block filesystems? http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@24 PS2, Line 24: Changed the utility function so it doesn't use Hive. adding coverage for S3 is nice, but do we lose any inter-operability coverage here between Hive and Impala? -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 06 Aug 2020 19:30:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@9 PS2, Line 9: Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED > is this true of all engines? do all engines use SNAPPY_BLOCKED for snappy-c Under the covers, we were already using SNAPPY_BLOCKED before. All our Snappy text test tables are written by Hive, and Hive uses SNAPPY_BLOCKED. Our text scanner has an assert that verifies that we don't pass THdfsCompression::SNAPPY into it. In the old code, the way this worked is that we translated SNAPPY to THdfsCompression::SNAPPY_BLOCKED when we converted to thrift here: https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java#L79 The non-block filesystems don't go through that Thrift codepath. They go through a flatbuffers codepath, which inadvertently didn't do the conversion from SNAPPY to SNAPPY_BLOCKED. https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java#L93 Long story short: the behavior should be the same as before (except non-block is fixed) http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@10 PS2, Line 10: for non-block filesystems > why does this only happen for non-block filesystems? Rolled the answer to this into the other comment. http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@24 PS2, Line 24: Changed the utility function so it doesn't use Hive. > adding coverage for S3 is nice, but do we lose any inter-operability covera The Hive statements are "create table like" statements and "drop table" statements, which are purely metadata. The compression codec is not part of the metadata. I don't think we lose any coverage that we don't have covered by other tests. The data that we are using was already written by Hive during dataload (i.e. we aren't writing data with Hive in this test). -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 06 Aug 2020 20:02:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Code-Review+2 Thanks for the explanations. LGTM. -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 06 Aug 2020 20:11:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. Patch Set 2: Thanks for the review! -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 2 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Thu, 06 Aug 2020 20:12:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems
Joe McDonnell has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16278 ) Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems .. IMPALA-10005: Fix Snappy decompression for non-block filesystems Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED type compression in the backend. However, for non-block filesystems, the frontend is incorrectly passing THdfsCompression::SNAPPY instead. On debug builds, this leads to a DCHECK when trying to read Snappy-compressed text. On release builds, it fails to decompress the data. This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED for Snappy-compressed text. This reworks query_test/test_compressed_formats.py to provide better coverage: - Changed the RC and Seq test cases to verify that the file extension doesn't matter. Added Avro to this case as well. - Fixed the text case to use appropriate extensions (fixing IMPALA-9004) - Changed the utility function so it doesn't use Hive. This allows it to be enabled on non-HDFS filesystems like S3. - Changed the test to use unique_database and allow parallel execution. - Changed the test to run in the core job, so it now has coverage on the usual S3 test configuration. It is reasonably quick (1-2 minutes) and runs in parallel. Testing: - Exhaustive job - Core s3 job - Changed the frontend to force it to use the code for non-block filesystems (i.e. the TFileSplitGeneratorSpec code) and verified that it is now able to read Snappy-compressed text. Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Reviewed-on: http://gerrit.cloudera.org:8080/16278 Tested-by: Impala Public Jenkins Reviewed-by: Sahil Takiar --- M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java M tests/query_test/test_compressed_formats.py 2 files changed, 132 insertions(+), 84 deletions(-) Approvals: Impala Public Jenkins: Verified Sahil Takiar: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/16278 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Gerrit-Change-Number: 16278 Gerrit-PatchSet: 3 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar