[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 21: This change did not cherrypick successfully into branch 2.x. To resolve this, please do the cherry-pick manually and submit it to Gerrit at refs/for/2.x or add an exception to the branch 2.x copy of bin/ignored_commits.json. Thanks, your friendly bot at https://jenkins.impala.io/job/cherrypick-2.x-and-test/609/ . -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 21 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Mon, 11 Jun 2018 07:57:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Reviewed-on: http://gerrit.cloudera.org:8080/8523 Reviewed-by: Vuk Ercegovac Tested-by: Impala Public Jenkins --- M CMakeLists.txt M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 22 files changed, 677 insertions(+), 247 deletions(-) Approvals: Vuk Ercegovac: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 21 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 20: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 20 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 08 Jun 2018 20:28:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/34/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 20 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 08 Jun 2018 16:56:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/34/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 20 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 08 Jun 2018 16:56:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/32/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 20 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 08 Jun 2018 16:35:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 19: rebased (clean). re-ran s3 tests (pass). -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 19 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 08 Jun 2018 06:43:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 18: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 18 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 23:09:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 18: latest change explicitly adds a dep for flat buffers for the backend (in the same place that the be depends on thrift and protos). afaict, this should have been in there for this change-- not sure why some of the builds didn't need it. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 18 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 21:13:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#18). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M CMakeLists.txt M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 22 files changed, 677 insertions(+), 247 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/18 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 18 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 17: Odd that this happens with gvo, but not for internal builds. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 17:40:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 17: hmm. looks like some dependency order for generating flat buffers. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 16:02:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 17: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2546/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 10:00:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 17: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2546/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 06:36:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 17: Code-Review+2 rebase and fix minor conflict. carrying dan's +2 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 06:29:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#17). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 676 insertions(+), 246 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/17 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 17 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 16: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 16 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 24 May 2018 00:13:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#16). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 676 insertions(+), 246 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/16 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 16 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 15: (3 comments) http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@783 PS15, Line 783: boolean checkMissingDiskIds = FileSystemUtil.supportsStorageIds(partitionFs); : boolean supportsBlocks = FileSystemUtil.supportsStorageIds(partitionFs); > these are equivalent. how about getting rid of the checkMissingDiskIds, and see reply to last comment for naming. http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@812 PS15, Line 812: result.first > was changing the polarity of this a bug fix? yes. there was a bug here, as well as double-counting total_bytes from a prev. merge, and several npe's from that latest thrift refactor (java thrift list is null, not empty) http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@914 PS15, Line 914: checkMissingDiskIds && !fileDesc.getIsEc() > this was confusing because when I read the name "checkMissingDiskIds" I tho renamed to: fsHasBlocks -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 15 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 23 May 2018 20:44:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 15: (3 comments) http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@783 PS15, Line 783: boolean checkMissingDiskIds = FileSystemUtil.supportsStorageIds(partitionFs); : boolean supportsBlocks = FileSystemUtil.supportsStorageIds(partitionFs); these are equivalent. how about getting rid of the checkMissingDiskIds, and renaming supportsBlocks to fsSupportsBlocks? http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@812 PS15, Line 812: result.first was changing the polarity of this a bug fix? http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@914 PS15, Line 914: checkMissingDiskIds && !fileDesc.getIsEc() this was confusing because when I read the name "checkMissingDiskIds" I thought: why doesn't the setting of that variable take into account EC? How about renaming that parameter to: fsHasBlocks or fsHasDiskIds or something to indicate that's a filesystem level property? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 15 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 23 May 2018 15:37:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#15). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 677 insertions(+), 245 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/15 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 15 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 14: Next change handles the merge. It was straightforward; the ec files are handled as regular hdfs files, so do not go through the prev. synthetic block generation. For testing, I re-ran core and s3 tests. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 14 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 23 May 2018 05:39:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 14: Looks like you'll have conflicts with this: https://gerrit.cloudera.org/#/c/10413/ -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 14 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 22 May 2018 16:29:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 14: running more tests... -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 14 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Sat, 19 May 2018 20:26:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 14 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Sat, 19 May 2018 03:02:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#14). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 676 insertions(+), 241 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/14 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 14 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 13: (4 comments) http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc@302 PS13, Line 302: for (const TScanRangeLocationList& range : entry.second.concrete_range) { : expanded_locations.push_back(range); : } > that could be just expanded_locations.insert(concrete_range.begin(), concre Done http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift File common/thrift/Planner.thrift: http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift@108 PS13, Line 108: concrete_range > plural since the field is a list: concrete_ranges, split_specs Done http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java File fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java@224 PS13, Line 224: scanRangeSpecs_.getSplit_specSize() > this seems wrong - a single spec may result in multiple hosts, no? removed. this class never adds specs, but added a precondition check in init in the event that's changed for some reason. http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1126 PS13, Line 1126: scanRangeSpecs_.getSplit_specSize(); > shouldn't that do some calculation based on file size and blocks size and i good catch. needed to change other places that directly use the size of the spec list. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 13 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Sat, 19 May 2018 01:39:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 13: (4 comments) Thanks, that looks simpler and clearer now. Just some minor things. http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc@302 PS13, Line 302: for (const TScanRangeLocationList& range : entry.second.concrete_range) { : expanded_locations.push_back(range); : } that could be just expanded_locations.insert(concrete_range.begin(), concrete_range.end())? http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift File common/thrift/Planner.thrift: http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift@108 PS13, Line 108: concrete_range plural since the field is a list: concrete_ranges, split_specs http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java File fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java@224 PS13, Line 224: scanRangeSpecs_.getSplit_specSize() this seems wrong - a single spec may result in multiple hosts, no? Though I guess for hbase this won't be set so in practice doesn't matter. But should we instead just assert that the split_spec list size is 0? http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1126 PS13, Line 1126: scanRangeSpecs_.getSplit_specSize(); shouldn't that do some calculation based on file size and blocks size and is splittable? i.e. after your change we'll get a different number for numRemoteRanges when running on S3, right? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 13 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 18 May 2018 23:09:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#13). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 665 insertions(+), 240 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/13 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 13 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift File common/thrift/Frontend.thrift: http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift@375 PS12, Line 375: per_node_scan_ranges > Thanks, the new thrift structures make more sense to me. Maybe it's simpler yeah, it lets me revert to upfront expansion as I had couple of changes back (but without mixed concrete/spec) and get rid of that iterator. http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift@202 PS12, Line 202: THdfsFileSplitGeneratorSpec > update Done -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 18 May 2018 21:24:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift File common/thrift/Frontend.thrift: http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift@375 PS12, Line 375: per_node_scan_ranges Thanks, the new thrift structures make more sense to me. Maybe it's simpler by moving the "union" up one level. i.e. make this: map> Then the backend can be simplified: the "ConcreteList" is already in the format you need, and the GeneratorSpecList can be converted to a ConcreteList in one go that you pass to the scheduler. So then no need for the iterator thingy. What do you think? http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift@202 PS12, Line 202: THdfsFileSplitGeneratorSpec update -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 17 May 2018 21:48:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13 PS12, Line 13: includes S3, ADLS, and local fs. > Does this change the dependency on fs.s3a.block.size? This change does not attempt to change the scan ranges that are produced for these file systems. Currently, we rely on the filestatus blocksize (see L177 of HdfsPartition) when synthesizing blocks to store in the catalog. This patch shifts that synthesis to the scheduler, so that its generated for each use instead of stored in memory. The block parameter used for this synthesis is set in this change on L784 of HdfsScanNode. Instead of filestatus blocksize, it uses the filesystem's default block size, which I think is the same thing for these file-systems (at least, from what I could tell from hadoop.fs.FileSystem). -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 16 May 2018 06:04:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Mostafa Mokhtar has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13 PS12, Line 13: includes S3, ADLS, and local fs. > there are todo's in this change to extend to hdfs. pls see the comment in P Does this change the dependency on fs.s3a.block.size? Having to rely on that parameter for generate even scan ranges is error prone. https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_s3.html#s3_performance -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 15 May 2018 23:15:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13 PS12, Line 13: includes S3, ADLS, and local fs. > What about erasure coded HDFS data? there are todo's in this change to extend to hdfs. pls see the comment in PlanNodes.thrift. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 15 May 2018 17:35:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Mostafa Mokhtar has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13 PS12, Line 13: includes S3, ADLS, and local fs. What about erasure coded HDFS data? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 15 May 2018 17:26:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#12). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/Frontend.thrift M common/thrift/PlanNodes.thrift M common/thrift/Planner.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java M fe/src/main/java/org/apache/impala/planner/ScanNode.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java 21 files changed, 718 insertions(+), 233 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/12 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 12 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (1 comment) getting back to this change... the upcoming update includes the idea to better separate the split specs vs. concrete scan ranges. rebase changes caught this in the middle so the update also includes rebase changes (only affected HdfsScanNode.java). http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228 PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec hdfs_file_split_generator_spec > will try it out. The additional changes are relatively mechanical. Since this design explicitly does not allow these specs to propagate further in the backend, I agree that we're better off going this route. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 10 May 2018 21:42:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: I got blocked behind some lib-cache stuff, but this is next on my list. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 29 Mar 2018 15:57:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: Are we going to move forward with this patch? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 29 Mar 2018 15:54:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG@34 PS11, Line 34: - all core tests pass when configured with s3 > Do the tests validate that an "even" number of bytes is assigned to each ba That block size is used in the specs here, HdfsScanNode.java L755. The division into scan ranges should be the same as before. As before, each of these scan ranges is "random", e.g., no process affinity, so as long as assignment of remote reads balances bytes per backend, then the result should be balanced. There's a scheduler unit test that tests remote reads and checks that the assignment is balanced. I did some manual queries and saw roughly even assignment. That said, I don't have a test that looks at bytes-per-host and checks that they're even. Will look at what other tests we have for this and what can be added. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 29 Mar 2018 01:49:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Mostafa Mokhtar has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG@34 PS11, Line 34: - all core tests pass when configured with s3 Do the tests validate that an "even" number of bytes is assigned to each backend? For S3 scans Impala relies fs.s3a.block.size to ensure scan ranges are evenly distributed.a -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 28 Mar 2018 06:02:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc@323 PS11, Line 323: ScanRangeIter range_iter(entry.second); > sorry if this was already discussed, but why not just transform the list he idea was to not make multiple passes. common case (for now) is to not explode. the current scheme does the least extra work over what we have now. http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228 PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec hdfs_file_split_generator_spec > Right, but what I'm thinking is replace listhttp://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 08 Mar 2018 00:03:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc@323 PS11, Line 323: ScanRangeIter range_iter(entry.second); sorry if this was already discussed, but why not just transform the list here all at once and pass the exploded list of ranges into ComputeScanRangeAssignment()? Is the intention to optimize memory use? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 07 Mar 2018 23:58:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228 PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec hdfs_file_split_generator_spec > we currently put all scan ranges from all partitions into the same list. pa Right, but what I'm thinking is replace list with list where that thing is basically a union: // Needs better name struct TSplitConcreteOrGenSpec { // Only set for concrete range. optional Planner.TScanRangeLocationList; // Only set for generated spec optional THdfsFileSplitGeneratorSpec } After all, once we've traversed in TScanRangeLocationList we're now talking about physical locations and such, so that stuff seems like it belongs on the "concrete" side of the divide. Maybe I'm still not seeing what you mean by "least damage" though. "Least damage" in what sense? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 07 Mar 2018 23:30:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@206 PS11, Line 206: Hdfs > I wonder if we should just drop Hdfs from this name given that this is used yeah, I just extended the current naming. its a good point to change it so I'll do that since I don't see a good reason now to carry along the current naming. http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228 PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec hdfs_file_split_generator_spec > Isn't it that the frontend generates either a list of (concrete) scan range we currently put all scan ranges from all partitions into the same list. partitions may differ in their file system, so from the hdfs scan node's perspective, it needs to create a list that can have a mix of concrete and generated ranges. we'd like to handle both cases uniformly (see L205) so doing it this way in this change does the least damage to the part that is not changed (concrete hdfs scan ranges). -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 07 Mar 2018 23:15:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 11: (2 comments) Had a question about the thrift, I will look at the backend side too. http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@206 PS11, Line 206: Hdfs I wonder if we should just drop Hdfs from this name given that this is used for other "filesystem" like storage (i.e. S3, ADLS). Ok to ignore if you prefer to keep it for some reason (maybe the reason to keep it is consistency with e.g. HdfsScanNode, which also handles S3/ADLS since they go through the same FileSystem & libhdfs access path). http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228 PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec hdfs_file_split_generator_spec Isn't it that the frontend generates either a list of (concrete) scan ranges, or it just gives a list of FileSplitGeneratorSpec? i.e. I would have expected this to appear as the alternative to Planner.TScanRangeLocationList since the that is the "concrete" scan range list that this thing is going to effectively explode into. i.e. it seems clearer to me to leave TScanRange as the "concrete" range that the coordinator will generate using this new "spec". So what's the reasoning for putting this here instead of higher up in the thrift plan tree? -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 07 Mar 2018 22:34:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Dimitris Tsirogiannis, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#11). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 13 files changed, 641 insertions(+), 191 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/11 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 9: (14 comments) main changes: (1) minimized changes to the main scheduler method with an iterator, (2) added backend tests http://gerrit.cloudera.org:8080/#/c/8523/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/9//COMMIT_MSG@30 PS9, Line 30: Testing: > Do we have a test that covers mixed queries? Have you considered adding a t thx for the suggestions. added tests to scheduler-test.cc. there are end-to-end tests for testing multiple file systems in the same table/query (see testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test) http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.h File be/src/scheduling/scheduler.h: http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.h@450 PS9, Line 450: instances > nit: instance Done http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@257 PS9, Line 257: int num_ranges = 0; > Have you considered generating the scan ranges here and passing them into C I tried something like this, but it created two pases so I backed it out. After discussing, so that we avoid additional complexity to the main ComputeRangeAssignment method, I wrapped the original scan ranges in an iterator that generates new scan ranges or returns the original ones. http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@504 PS9, Line 504: // expanded_locations. 'spec' is assumed to *not* have block location information. > missing rename? removed http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@504 PS9, Line 504: 'spec' is assumed to *not* have block location information. > It doesn't have a member to do so. Is this a stale comment? removed. this refers to longer range plan where hdfs files (with block info) will also be handled. the todo in the thrift spec covers this. http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@505 PS9, Line 505: HDFS > This comment seems a bit misleading, since S3 etc are not HDFS filesystems. remvoed http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@516 PS9, Line 516: long scan_range_offset = 0; > Any reason to not use int64_t here? Done http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@518 PS9, Line 518: long scan_range_length = std::min(spec.max_block_size, fb_desc->length()); > I'm not sure it can happen, but if spec.max_block_size == 0, the loop below good point. added a check for this in FE as well as here. http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@582 PS9, Line 582: Defer > "Defer" read to me like if something is being done immediately. Can you cla Done http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@603 PS9, Line 603: *num_ranges += 1; > If you move this count to L676 right after the call to RecordScanRangeAssig got rid of this part of the api. the number of scan ranges is now accessed via the iterator. http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@681 PS9, Line 681: remote_scan_range_locations.push_back(); > nit: use vector<>::insert() instead of a loop here. removed http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@686 PS9, Line 686: // Allow exec_at_coord to take precedence over selecting a remote host. > This should only happen for generated scan ranges. Can you add a DCHECK to removed http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift@204 PS9, Line 204: // -1 means not-splittable > nit: add newlines above comments. Done http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift@205 PS9, Line 205: 2: required i64 max_block_size > Could this ever be infinity? I feel it might be clearer to have a boolean f Done -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 9 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Mon, 12 Feb 2018
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Lars Volker, Dimitris Tsirogiannis, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#10). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 13 files changed, 638 insertions(+), 191 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/10 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 10 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/8523/8/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/8/common/thrift/PlanNodes.thrift@212 PS8, Line 212: 3: optional binary kudu_scan_token we discussed in person, but since we'll eventually want to unify all these cases (the frontend can send blocks and locations), let's add a comment to that effect and perhaps reference the JIRA that will track the remaining work. That way, readers of the code can have a hint as to the direction we want things to move (and that currently we're in a transitional state that isn't the end goal). -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Tue, 30 Jan 2018 18:24:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 8: I probably won't have a chance to look through this in detail, but I think Lars should have a look at the scheduler changes - he's the most familiar with that code. -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 26 Jan 2018 18:20:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 8: Code-Review+1 carry +1 from dimitris -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 24 Jan 2018 18:29:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#8). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 11 files changed, 399 insertions(+), 179 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/8 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 7: (6 comments) in addition to the comments, decided to move that util function to its own file. http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509 PS7, Line 509: vector* expanded_locations, : int* num_generated) { > Indentation is off Done http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h File be/src/util/compress.h: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@136 PS7, Line 136: flat buffer > nit: FlatBuffer so that it is clear what this thing represents Done http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@139 PS7, Line 139: > nit: indentation Done http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc File be/src/util/compress.cc: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc@325 PS7, Line 325: THdfsCompression::type* thrift_compression) { > Your indentation is off is several places Done http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@154 PS7, Line 154: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@738 PS7, Line 738: supportsBlocks > in the other file you're using hasBlockInfo. I think I prefer this one bett Done -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 7 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Mon, 22 Jan 2018 23:55:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 7: Code-Review+1 (7 comments) Mostly minor comments but I am happy with the FE changes. You should ask someone to take another look at BE as well. http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509 PS7, Line 509: expanded_locations Maybe rename to 'generated_scan_ranges'? http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509 PS7, Line 509: vector* expanded_locations, : int* num_generated) { Indentation is off http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h File be/src/util/compress.h: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@136 PS7, Line 136: flat buffer nit: FlatBuffer so that it is clear what this thing represents http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@139 PS7, Line 139: nit: indentation http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc File be/src/util/compress.cc: http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc@325 PS7, Line 325: THdfsCompression::type* thrift_compression) { Your indentation is off is several places http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@154 PS7, Line 154: nit: extra space http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@738 PS7, Line 738: supportsBlocks in the other file you're using hasBlockInfo. I think I prefer this one better :) -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 7 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Mon, 22 Jan 2018 19:50:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#7). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/compress.cc M be/src/util/compress.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 10 files changed, 344 insertions(+), 180 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/7 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 7 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#6). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 8 files changed, 332 insertions(+), 183 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/6 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 6 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#5). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/service/client-request-state.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 9 files changed, 333 insertions(+), 183 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/5 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 5 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24 PS4, Line 24: datastructures > nit: data structures Done http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260 PS4, Line 260: // Map of plan node to scan ranges that are computed from request_. If a plan node is : // not present, then it has no associated additional scan ranges. : // Populated by the ComputeScanRanges. : std::mapcomputed_ranges_; > How is this related to PerNodeScanRanges (L41)? backing out this refactor. http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258 PS4, Line 258: // Combine actual and computed ranges from the schedule. : const vector* scan_ranges = : const vector* computed_ranges = : schedule->GetComputedRanges(entry.first); : vector combined_ranges; : if (computed_ranges != nullptr) { : for (const auto& range : entry.second) { : if (!range.scan_range.__isset.hdfs_file_split_spec) { : combined_ranges.push_back(range); : } : } : for (const auto& range : (*schedule->GetComputedRanges(entry.first))) { : combined_ranges.push_back(range); : } : scan_ranges = _ranges; : } > I think my previous comment made things worse, sorry about that. I think it no problems. added back here. expansion is done per spec in a separate helper function below. http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199 PS4, Line 199: 3: required i64 partition_id > Comment what this is referencing. Done http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209 PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec > Add a comment either here or at the top to describe when THdfsFileSplit vs Lets go with THdfsFileSplitGeneratorSpec http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126 PS4, Line 126: fileFormat' > nit: missing quote Done http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712 PS4, Line 712: fileDesc.getNumFileBlocks() > What will happen if this is a zero-byte file (no blocks). Can't we prune th changed it to be more explicit (based on the fs). http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743 PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName()) > Do we need to do this for every file? Can't we do this once per partition? good point, lifted that to be per partition. http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761 PS4, Line 761: disk ids are expected in blocks, :* checkMissingDiskIds should be set to true. If disk ids are checked, > "checkMissingDiskIds is true, " Done -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 4 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Thu, 18 Jan 2018 05:47:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24 PS4, Line 24: datastructures nit: data structures http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260 PS4, Line 260: // Map of plan node to scan ranges that are computed from request_. If a plan node is : // not present, then it has no associated additional scan ranges. : // Populated by the ComputeScanRanges. : std::mapcomputed_ranges_; How is this related to PerNodeScanRanges (L41)? http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258 PS4, Line 258: // Combine actual and computed ranges from the schedule. : const vector* scan_ranges = : const vector* computed_ranges = : schedule->GetComputedRanges(entry.first); : vector combined_ranges; : if (computed_ranges != nullptr) { : for (const auto& range : entry.second) { : if (!range.scan_range.__isset.hdfs_file_split_spec) { : combined_ranges.push_back(range); : } : } : for (const auto& range : (*schedule->GetComputedRanges(entry.first))) { : combined_ranges.push_back(range); : } : scan_ranges = _ranges; : } I think my previous comment made things worse, sorry about that. I think it's fine to move the scan range generation here. In that way, we will avoid the double iteration over all plan nodes and the additional state (per node computed scan ranges). Just make sure you wrap this into a separate function. http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199 PS4, Line 199: 3: required i64 partition_id Comment what this is referencing. http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209 PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec Add a comment either here or at the top to describe when THdfsFileSplit vs THdfsFileSplitSpec is set. Actually, THdfsFileSplitSpec may not be the best name. How about THdfsFileSplitGenerationSpec or something like that? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126 PS4, Line 126: fileFormat' nit: missing quote http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712 PS4, Line 712: fileDesc.getNumFileBlocks() What will happen if this is a zero-byte file (no blocks). Can't we prune this early on or it never reaches that point? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743 PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName()) Do we need to do this for every file? Can't we do this once per partition? http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761 PS4, Line 761: disk ids are expected in blocks, :* checkMissingDiskIds should be set to true. If disk ids are checked, "checkMissingDiskIds is true, " -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 4 Gerrit-Owner: Vuk Ercegovac Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 17 Jan 2018 19:30:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Hello Dimitris Tsirogiannis, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#4). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range datastructures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.cc M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/service/client-request-state.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 9 files changed, 343 insertions(+), 178 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/4 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 4 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc@582 PS3, Line 582: ComputeScanRangeAssignment > This function now does both assignment and generation of scan ranges. Also, I moved it to QueryScheduler... that seems to be the place where data about scheduling is stored (including the request with scan ranges from the FE). Now, the idea is to first generate these ranges (done in client request), then schedule. If there is a better place to store such per request info, let me know and I can easily move it there. The only thing the scheduler knows about all of this right now is to stitch together computed and given ranges-- that's a bit gorpy (L258) and I plan to fix it if this flow/organization looks ok. http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift@195 PS3, Line 195: struct THdfsFileSplitSpec { : 1: required CatalogObjects.THdfsFileDesc file_desc : // -1 means not-splittable : 2: required i64 max_block_size : 3: required i64 partition_id : } > THdfsFileDesc and THdfsFileSplit have quite a bit of intersection wrt the i I did this for two reasons: 1) I expect this Spec to be the only thing that comes out of the FE for hdfs files (unexpanded) and THdfsFileSplit will only be present on the BE (expanded). Currently, I'm doing this just for s3/adls/local, but I'm thinking to apply it to all (not this change) hdfs files. 2) I preferred to use the type-checker to tell me which representation I have (expanded vs. unexpanded). Otherwise, I'd have to peer into THdfsFileSplit to determine this state, which seems more error-prone. http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@131 PS3, Line 131: new int[] {} > I believe you don't have to do that. You can probably pass null and then av Done http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@338 PS3, Line 338: its > typo: it's Done http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@666 PS3, Line 666: private void generateScanRangeSpecs > No intention to be picky but can you move this and tansformBlocksToScanRang Done http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@694 PS3, Line 694:* TScanRangeLocationLists are added scanRanges_. If disk ids are expected in blocks > to Done -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 3 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Fri, 12 Jan 2018 01:19:09 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc@582 PS3, Line 582: ComputeScanRangeAssignment This function now does both assignment and generation of scan ranges. Also, it feels kind of weird to have the generation part in the scheduler. I think it may be better to separate these two concepts and move the scan range generation part out of here. http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift@195 PS3, Line 195: struct THdfsFileSplitSpec { : 1: required CatalogObjects.THdfsFileDesc file_desc : // -1 means not-splittable : 2: required i64 max_block_size : 3: required i64 partition_id : } THdfsFileDesc and THdfsFileSplit have quite a bit of intersection wrt the information they store. I am wondering if it would be simpler to modify THdfsFileSplit to handle both cases and avoid THdfsFileSplitSpec altogether. http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@131 PS3, Line 131: new int[] {} I believe you don't have to do that. You can probably pass null and then avoid L143 and L150 with couple of checks. http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@338 PS3, Line 338: its typo: it's http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@666 PS3, Line 666: private void generateScanRangeSpecs No intention to be picky but can you move this and tansformBlocksToScanRanges below function computeScanRangeLocations so that the code reads in a top-down fashion? thanks http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@694 PS3, Line 694:* TScanRangeLocationLists are added scanRanges_. If disk ids are expected in blocks to -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 3 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 03 Jan 2018 00:17:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. Patch Set 3: all tests pass with s3 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 3 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Vuk Ercegovac Gerrit-Comment-Date: Wed, 20 Dec 2017 17:19:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls
Vuk Ercegovac has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls .. IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range datastructures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 5 files changed, 294 insertions(+), 166 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/3 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 3 Gerrit-Owner: Vuk ErcegovacGerrit-Reviewer: Vuk Ercegovac
[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls (wip)
Vuk Ercegovac has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/8523 ) Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls (wip) .. IMPALA-5931: Generates scan ranges in planner for s3/adls (wip) Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range datastructures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. The "wip" status is currently to add/run more tests. Testing: - local filesystem tests exercise this code path - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: s3 and adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 5 files changed, 282 insertions(+), 165 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/2 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 2 Gerrit-Owner: Vuk Ercegovac