Hello Lars Volker, Dimitris Tsirogiannis, Tim Armstrong, Dan Hecht, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/8523 to look at the new patch set (#11). Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls ...................................................................... IMPALA-5931: Generates scan ranges in planner for s3/adls Currently, for filesystems that do not include physical block information (e.g., block replica locations, caching), synthetic blocks are generated and stored in the catalog when metadata is loaded. Example file systems for which this is done includes S3, ADLS, and local fs. This change avoids generating these blocks when metadata is loaded. Instead, scan ranges are directly generated from such files by the backend coordinator. Previously, all scan ranges were produced by the planner in HDFSScanNode in the frontend. Now, those files without block information are sent to the coordinator represented by a split specification that determines how the coordinator will create scan ranges to send to executors. This change reduces the space needed in the catalog and reduces the scan range data structures that are passed from the frontend to the backend when planning and coordinating a query. In addition a bug is avoided where non-splittable files were being split anyways to support the query parameter that places a limit on scan ranges. Testing: - added backend scheduler tests - mixed-filesystems test covers tables/queries with multiple fs's. - local fs tests cover the code paths in this change - all core tests pass when configured with s3 - manually tried larger local filesystem tables (tpch) with multiple partitions and observed the same scan ranges. - TODO: adls testing Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 --- M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/util/CMakeLists.txt A be/src/util/flat_buffer.cc A be/src/util/flat_buffer.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 13 files changed, 641 insertions(+), 191 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/11 -- To view, visit http://gerrit.cloudera.org:8080/8523 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5 Gerrit-Change-Number: 8523 Gerrit-PatchSet: 11 Gerrit-Owner: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>