[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 21:

This change did not cherrypick successfully into branch 2.x. To resolve this, 
please do the cherry-pick manually and submit it to Gerrit at refs/for/2.x or 
add an exception to the branch 2.x copy of bin/ignored_commits.json. Thanks, 
your friendly bot at https://jenkins.impala.io/job/cherrypick-2.x-and-test/609/ 
.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 21
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Mon, 11 Jun 2018 07:57:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Reviewed-on: http://gerrit.cloudera.org:8080/8523
Reviewed-by: Vuk Ercegovac 
Tested-by: Impala Public Jenkins 
---
M CMakeLists.txt
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
22 files changed, 677 insertions(+), 247 deletions(-)

Approvals:
  Vuk Ercegovac: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 21
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 20: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 20
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 08 Jun 2018 20:28:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/34/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 20
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 08 Jun 2018 16:56:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/34/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 20
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 08 Jun 2018 16:56:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/32/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 20
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 08 Jun 2018 16:35:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-06-08 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 19:

rebased (clean). re-ran s3 tests (pass).


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 19
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 08 Jun 2018 06:43:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 18: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 18
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 23:09:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 18:

latest change explicitly adds a dep for flat buffers for the backend (in the 
same place that the be depends on thrift and protos). afaict, this should have 
been in there for this change-- not sure why some of the builds didn't need it.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 18
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 21:13:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa 
Mokhtar, Dan Hecht, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#18).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M CMakeLists.txt
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
22 files changed, 677 insertions(+), 247 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/18
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 18
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 17:

Odd that this happens with gvo, but not for internal builds.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 17:40:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 17:

hmm. looks like some dependency order for generating flat buffers.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 16:02:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 17: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2546/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 10:00:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2546/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 06:36:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 17: Code-Review+2

rebase and fix minor conflict.
carrying dan's +2


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 06:29:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-24 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa 
Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#17).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 676 insertions(+), 246 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/17
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 17
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-23 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 16: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 16
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 24 May 2018 00:13:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-23 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa 
Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#16).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 676 insertions(+), 246 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/16
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 16
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-23 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@783
PS15, Line 783:   boolean checkMissingDiskIds = 
FileSystemUtil.supportsStorageIds(partitionFs);
  :   boolean supportsBlocks = 
FileSystemUtil.supportsStorageIds(partitionFs);
> these are equivalent. how about getting rid of the checkMissingDiskIds, and
see reply to last comment for naming.


http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@812
PS15, Line 812: result.first
> was changing the polarity of this a bug fix?
yes. there was a bug here, as well as double-counting total_bytes from a prev. 
merge, and several npe's from that latest thrift refactor (java thrift list is 
null, not empty)


http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@914
PS15, Line 914: checkMissingDiskIds && !fileDesc.getIsEc()
> this was confusing because when I read the name "checkMissingDiskIds" I tho
renamed to: fsHasBlocks



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 15
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 23 May 2018 20:44:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-23 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@783
PS15, Line 783:   boolean checkMissingDiskIds = 
FileSystemUtil.supportsStorageIds(partitionFs);
  :   boolean supportsBlocks = 
FileSystemUtil.supportsStorageIds(partitionFs);
these are equivalent. how about getting rid of the checkMissingDiskIds, and 
renaming supportsBlocks to fsSupportsBlocks?


http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@812
PS15, Line 812: result.first
was changing the polarity of this a bug fix?


http://gerrit.cloudera.org:8080/#/c/8523/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@914
PS15, Line 914: checkMissingDiskIds && !fileDesc.getIsEc()
this was confusing because when I read the name "checkMissingDiskIds" I 
thought: why doesn't the setting of that variable take into account EC?
How about renaming that parameter to: fsHasBlocks or fsHasDiskIds or something 
to indicate that's a filesystem level property?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 15
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 23 May 2018 15:37:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-22 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Tianyi Wang, Dimitris Tsirogiannis, Alex Behm, Mostafa 
Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#15).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 677 insertions(+), 245 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/15
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 15
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-22 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 14:

Next change handles the merge. It was straightforward; the ec files are handled 
as regular hdfs files, so do not go through the prev. synthetic block 
generation.

For testing, I re-ran core and s3 tests.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 14
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 23 May 2018 05:39:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-22 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 14:

Looks like you'll have conflicts with this: 
https://gerrit.cloudera.org/#/c/10413/


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 14
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 22 May 2018 16:29:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-19 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 14:

running more tests...


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 14
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Sat, 19 May 2018 20:26:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 14: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 14
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Sat, 19 May 2018 03:02:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#14).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 676 insertions(+), 241 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/14
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 14
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 13:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc@302
PS13, Line 302: for (const TScanRangeLocationList& range : 
entry.second.concrete_range) {
  :   expanded_locations.push_back(range);
  : }
> that could be just expanded_locations.insert(concrete_range.begin(), concre
Done


http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift
File common/thrift/Planner.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift@108
PS13, Line 108: concrete_range
> plural since the field is a list: concrete_ranges, split_specs
Done


http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
File fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java@224
PS13, Line 224: scanRangeSpecs_.getSplit_specSize()
> this seems wrong - a single spec may result in multiple hosts, no?
removed. this class never adds specs, but added a precondition check in init in 
the event that's changed for some reason.


http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1126
PS13, Line 1126: scanRangeSpecs_.getSplit_specSize();
> shouldn't that do some calculation based on file size and blocks size and i
good catch. needed to change other places that directly use the size of the 
spec list.



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 13
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Sat, 19 May 2018 01:39:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 13:

(4 comments)

Thanks, that looks simpler and clearer now. Just some minor things.

http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/13/be/src/scheduling/scheduler.cc@302
PS13, Line 302: for (const TScanRangeLocationList& range : 
entry.second.concrete_range) {
  :   expanded_locations.push_back(range);
  : }
that could be just expanded_locations.insert(concrete_range.begin(), 
concrete_range.end())?


http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift
File common/thrift/Planner.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/13/common/thrift/Planner.thrift@108
PS13, Line 108: concrete_range
plural since the field is a list: concrete_ranges, split_specs


http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
File fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java@224
PS13, Line 224: scanRangeSpecs_.getSplit_specSize()
this seems wrong - a single spec may result in multiple hosts, no?
Though I guess for hbase this won't be set so in practice doesn't matter. But 
should we instead just assert that the split_spec list size is 0?


http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1126
PS13, Line 1126: scanRangeSpecs_.getSplit_specSize();
shouldn't that do some calculation based on file size and blocks size and is 
splittable? i.e. after your change we'll get a different number for 
numRemoteRanges when running on S3, right?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 13
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 18 May 2018 23:09:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#13).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 665 insertions(+), 240 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/13
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 13
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-18 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift
File common/thrift/Frontend.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift@375
PS12, Line 375:   per_node_scan_ranges
> Thanks, the new thrift structures make more sense to me. Maybe it's simpler
yeah, it lets me revert to upfront expansion as I had couple of changes back 
(but without mixed concrete/spec) and get rid of that iterator.


http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift@202
PS12, Line 202: THdfsFileSplitGeneratorSpec
> update
Done



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 18 May 2018 21:24:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-17 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift
File common/thrift/Frontend.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/Frontend.thrift@375
PS12, Line 375:   per_node_scan_ranges
Thanks, the new thrift structures make more sense to me. Maybe it's simpler by 
moving the "union" up one level. i.e. make this:

map>

Then the backend can be simplified: the "ConcreteList" is already in the format 
you need, and the GeneratorSpecList can be converted to a ConcreteList in one 
go that you pass to the scheduler. So then no need for the iterator thingy.

What do you think?


http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/12/common/thrift/PlanNodes.thrift@202
PS12, Line 202: THdfsFileSplitGeneratorSpec
update



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 17 May 2018 21:48:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-16 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13
PS12, Line 13: includes S3, ADLS, and local fs.
> Does this change the dependency on fs.s3a.block.size?
This change does not attempt to change the scan ranges that are produced for 
these file systems. Currently, we rely on the filestatus blocksize (see L177 of 
HdfsPartition) when synthesizing blocks to store in the catalog. This patch 
shifts that synthesis to the scheduler, so that its generated for each use 
instead of stored in memory. The block parameter used for this synthesis is set 
in this change on L784 of HdfsScanNode. Instead of filestatus blocksize, it 
uses the filesystem's default block size, which I think is the same thing for 
these file-systems (at least, from what I could tell from hadoop.fs.FileSystem).



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 16 May 2018 06:04:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-15 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13
PS12, Line 13: includes S3, ADLS, and local fs.
> there are todo's in this change to extend to hdfs. pls see the comment in P
Does this change the dependency on fs.s3a.block.size?
Having to rely on that parameter for generate even scan ranges is error prone.

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_s3.html#s3_performance



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 15 May 2018 23:15:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-15 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13
PS12, Line 13: includes S3, ADLS, and local fs.
> What about erasure coded HDFS data?
there are todo's in this change to extend to hdfs. pls see the comment in 
PlanNodes.thrift.



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 15 May 2018 17:35:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-15 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/12//COMMIT_MSG@13
PS12, Line 13: includes S3, ADLS, and local fs.
What about erasure coded HDFS data?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 15 May 2018 17:26:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-10 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Dimitris Tsirogiannis, Alex Behm, Mostafa Mokhtar, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#12).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/Frontend.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Planner.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/DataSourceScanNode.java
M fe/src/main/java/org/apache/impala/planner/HBaseScanNode.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M fe/src/main/java/org/apache/impala/planner/ScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/planner/PlannerTestBase.java
21 files changed, 718 insertions(+), 233 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/12
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 12
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-05-10 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(1 comment)

getting back to this change... the upcoming update includes the idea to better 
separate the split specs vs. concrete scan ranges. rebase changes caught this 
in the middle so the update also includes rebase changes (only affected 
HdfsScanNode.java).

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228
PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec 
hdfs_file_split_generator_spec
> will try it out.
The additional changes are relatively mechanical. Since this design explicitly 
does not allow these specs to propagate further in the backend, I agree that 
we're better off going this route.



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 10 May 2018 21:42:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-29 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

I got blocked behind some lib-cache stuff, but this is next on my list.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 29 Mar 2018 15:57:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-29 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

Are we going to move forward with this patch?


-- 
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 29 Mar 2018 15:54:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-28 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG@34
PS11, Line 34: - all core tests pass when configured with s3
> Do the tests validate that an "even" number of bytes is assigned to each ba
That block size is used in the specs here, HdfsScanNode.java L755. The division 
into scan ranges should be the same as before. As before, each of these scan 
ranges is "random", e.g., no process affinity, so as long as assignment of 
remote reads balances bytes per backend, then the result should be balanced.
There's a scheduler unit test that tests remote reads and checks that the 
assignment is balanced.
I did some manual queries and saw roughly even assignment.
That said, I don't have a test that looks at bytes-per-host and checks that 
they're even. Will look at what other tests we have for this and what can be 
added.



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 29 Mar 2018 01:49:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-28 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/11//COMMIT_MSG@34
PS11, Line 34: - all core tests pass when configured with s3
Do the tests validate that an "even" number of bytes is assigned to each 
backend?
For S3 scans Impala relies fs.s3a.block.size to ensure scan ranges are evenly 
distributed.a



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 28 Mar 2018 06:02:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-07 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc@323
PS11, Line 323:   ScanRangeIter range_iter(entry.second);
> sorry if this was already discussed, but why not just transform the list he
idea was to not make multiple passes. common case (for now) is to not explode. 
the current scheme does the least extra work over what we have now.


http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228
PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec 
hdfs_file_split_generator_spec
> Right, but what I'm thinking is replace listhttp://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 08 Mar 2018 00:03:59 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-07 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/11/be/src/scheduling/scheduler.cc@323
PS11, Line 323:   ScanRangeIter range_iter(entry.second);
sorry if this was already discussed, but why not just transform the list here 
all at once and pass the exploded list of ranges into 
ComputeScanRangeAssignment()? Is the intention to optimize memory use?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 07 Mar 2018 23:58:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-07 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228
PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec 
hdfs_file_split_generator_spec
> we currently put all scan ranges from all partitions into the same list. pa
Right, but what I'm thinking is replace list 
with list where that thing is basically a union:

// Needs better name
struct TSplitConcreteOrGenSpec {
   // Only set for concrete range.
   optional Planner.TScanRangeLocationList;
   // Only set for generated spec
   optional THdfsFileSplitGeneratorSpec
}

After all, once we've traversed in TScanRangeLocationList we're now talking 
about physical locations and such, so that stuff seems like it belongs on the 
"concrete" side of the divide.

Maybe I'm still not seeing what you mean by "least damage" though.  "Least 
damage" in what sense?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 07 Mar 2018 23:30:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-07 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@206
PS11, Line 206: Hdfs
> I wonder if we should just drop Hdfs from this name given that this is used
yeah, I just extended the current naming. its a good point to change it so I'll 
do that since I don't see a good reason now to carry along the current naming.


http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228
PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec 
hdfs_file_split_generator_spec
> Isn't it that the frontend generates either a list of (concrete) scan range
we currently put all scan ranges from all partitions into the same list. 
partitions may differ in their file system, so from the hdfs scan node's 
perspective, it needs to create a list that can have a mix of concrete and 
generated ranges.

we'd like to handle both cases uniformly (see L205) so doing it this way in 
this change does the least damage to the part that is not changed (concrete 
hdfs scan ranges).



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 07 Mar 2018 23:15:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-03-07 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 11:

(2 comments)

Had a question about the thrift, I will look at the backend side too.

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@206
PS11, Line 206: Hdfs
I wonder if we should just drop Hdfs from this name given that this is used for 
other "filesystem" like storage (i.e. S3, ADLS). Ok to ignore if you prefer to 
keep it for some reason (maybe the reason to keep it is consistency with e.g. 
HdfsScanNode, which also handles S3/ADLS since they go through the same 
FileSystem & libhdfs access path).


http://gerrit.cloudera.org:8080/#/c/8523/11/common/thrift/PlanNodes.thrift@228
PS11, Line 228: 4: optional THdfsFileSplitGeneratorSpec 
hdfs_file_split_generator_spec
Isn't it that the frontend generates either a list of (concrete) scan ranges, 
or it just gives a list of FileSplitGeneratorSpec?  i.e. I would have expected 
this to appear as the alternative to Planner.TScanRangeLocationList since the 
that is the "concrete" scan range list that this thing is going to effectively 
explode into.
i.e. it seems clearer to me to leave TScanRange as the "concrete" range that 
the coordinator will generate using this new "spec".
So what's the reasoning for putting this here instead of higher up in the 
thrift plan tree?



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 07 Mar 2018 22:34:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-02-19 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Dimitris Tsirogiannis, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#11).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
13 files changed, 641 insertions(+), 191 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/11
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-02-12 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 9:

(14 comments)

main changes: (1) minimized changes to the main scheduler method with an 
iterator, (2) added backend tests

http://gerrit.cloudera.org:8080/#/c/8523/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/9//COMMIT_MSG@30
PS9, Line 30: Testing:
> Do we have a test that covers mixed queries? Have you considered adding a t
thx for the suggestions. added tests to scheduler-test.cc. there are end-to-end 
tests for testing multiple file systems in the same table/query (see 
testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test)


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.h
File be/src/scheduling/scheduler.h:

http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.h@450
PS9, Line 450: instances
> nit: instance
Done


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@257
PS9, Line 257:   int num_ranges = 0;
> Have you considered generating the scan ranges here and passing them into C
I tried something like this, but it created two pases so I backed it out. After 
discussing, so that we avoid additional complexity to the main 
ComputeRangeAssignment method, I wrapped the original scan ranges in an 
iterator that generates new scan ranges or returns the original ones.


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@504
PS9, Line 504: // expanded_locations. 'spec' is assumed to *not* have block 
location information.
> missing rename?
removed


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@504
PS9, Line 504: 'spec' is assumed to *not* have block location information.
> It doesn't have a member to do so. Is this a stale comment?
removed. this refers to longer range plan where hdfs files (with block info) 
will also be handled. the todo in the thrift spec covers this.


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@505
PS9, Line 505: HDFS
> This comment seems a bit misleading, since S3 etc are not HDFS filesystems.
remvoed


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@516
PS9, Line 516:   long scan_range_offset = 0;
> Any reason to not use int64_t here?
Done


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@518
PS9, Line 518:   long scan_range_length = std::min(spec.max_block_size, 
fb_desc->length());
> I'm not sure it can happen, but if spec.max_block_size == 0, the loop below
good point. added a check for this in FE as well as here.


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@582
PS9, Line 582: Defer
> "Defer" read to me like if something is being done immediately. Can you cla
Done


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@603
PS9, Line 603: *num_ranges += 1;
> If you move this count to L676 right after the call to RecordScanRangeAssig
got rid of this part of the api. the number of scan ranges is now accessed via 
the iterator.


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@681
PS9, Line 681: remote_scan_range_locations.push_back();
> nit: use vector<>::insert() instead of a loop here.
removed


http://gerrit.cloudera.org:8080/#/c/8523/9/be/src/scheduling/scheduler.cc@686
PS9, Line 686: // Allow exec_at_coord to take precedence over selecting a 
remote host.
> This should only happen for generated scan ranges. Can you add a DCHECK to 
removed


http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift@204
PS9, Line 204:   // -1 means not-splittable
> nit: add newlines above comments.
Done


http://gerrit.cloudera.org:8080/#/c/8523/9/common/thrift/PlanNodes.thrift@205
PS9, Line 205:   2: required i64 max_block_size
> Could this ever be infinity? I feel it might be clearer to have a boolean f
Done



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 9
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Mon, 12 Feb 2018 

[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-02-12 Thread Vuk Ercegovac (Code Review)
Hello Lars Volker, Dimitris Tsirogiannis, Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#10).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- added backend scheduler tests
- mixed-filesystems test covers tables/queries with multiple fs's.
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
13 files changed, 638 insertions(+), 191 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/10
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-30 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8523/8/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/8/common/thrift/PlanNodes.thrift@212
PS8, Line 212:   3: optional binary kudu_scan_token
we discussed in person, but since we'll eventually want to unify all these 
cases (the frontend can send blocks and locations), let's add a comment to that 
effect and perhaps reference the JIRA that will track the remaining work. That 
way, readers of the code can have a hint as to the direction we want things to 
move (and that currently we're in a transitional state that isn't the end goal).



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Tue, 30 Jan 2018 18:24:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 8:

I probably won't have a chance to look through this in detail, but I think Lars 
should have a look at the scheduler changes - he's the most familiar with that 
code.


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 26 Jan 2018 18:20:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-24 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 8: Code-Review+1

carry +1 from dimitris


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 24 Jan 2018 18:29:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-22 Thread Vuk Ercegovac (Code Review)
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#8).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/CMakeLists.txt
A be/src/util/flat_buffer.cc
A be/src/util/flat_buffer.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 399 insertions(+), 179 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/8
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-22 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 7:

(6 comments)

in addition to the comments, decided to move that util function to its own file.

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509
PS7, Line 509:   vector* 
expanded_locations,
 :  int* num_generated) {
> Indentation is off
Done


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h
File be/src/util/compress.h:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@136
PS7, Line 136: flat buffer
> nit: FlatBuffer so that it is clear what this thing represents
Done


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@139
PS7, Line 139:  
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc
File be/src/util/compress.cc:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc@325
PS7, Line 325:  THdfsCompression::type* 
thrift_compression) {
> Your indentation is off is several places
Done


http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@154
PS7, Line 154:
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@738
PS7, Line 738: supportsBlocks
> in the other file you're using hasBlockInfo. I think I prefer this one bett
Done



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 7
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Mon, 22 Jan 2018 23:55:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-22 Thread Dimitris Tsirogiannis (Code Review)
Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 7: Code-Review+1

(7 comments)

Mostly minor comments but I am happy with the FE changes. You should ask 
someone to take another look at BE as well.

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509
PS7, Line 509: expanded_locations
Maybe rename to 'generated_scan_ranges'?


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/scheduling/scheduler.cc@509
PS7, Line 509:   vector* 
expanded_locations,
 :  int* num_generated) {
Indentation is off


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h
File be/src/util/compress.h:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@136
PS7, Line 136: flat buffer
nit: FlatBuffer so that it is clear what this thing represents


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.h@139
PS7, Line 139:  
nit: indentation


http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc
File be/src/util/compress.cc:

http://gerrit.cloudera.org:8080/#/c/8523/7/be/src/util/compress.cc@325
PS7, Line 325:  THdfsCompression::type* 
thrift_compression) {
Your indentation is off is several places


http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@154
PS7, Line 154:
nit: extra space


http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/7/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@738
PS7, Line 738: supportsBlocks
in the other file you're using hasBlockInfo. I think I prefer this one better :)



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 7
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Mon, 22 Jan 2018 19:50:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-21 Thread Vuk Ercegovac (Code Review)
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#7).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/util/compress.cc
M be/src/util/compress.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
10 files changed, 344 insertions(+), 180 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/7
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 7
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-17 Thread Vuk Ercegovac (Code Review)
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#6).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
8 files changed, 332 insertions(+), 183 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/6
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 6
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-17 Thread Vuk Ercegovac (Code Review)
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#5).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range data structures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/service/client-request-state.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
9 files changed, 333 insertions(+), 183 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/5
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 5
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-17 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24
PS4, Line 24: datastructures
> nit: data structures
Done


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h
File be/src/scheduling/query-schedule.h:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260
PS4, Line 260: // Map of plan node to scan ranges that are computed from 
request_. If a plan node is
 :   // not present, then it has no associated additional scan 
ranges.
 :   // Populated by the ComputeScanRanges.
 :   std::map 
computed_ranges_;
> How is this related to PerNodeScanRanges (L41)?
backing out this refactor.


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258
PS4, Line 258: // Combine actual and computed ranges from the schedule.
 :   const vector* scan_ranges = 

 :   const vector* computed_ranges =
 :   schedule->GetComputedRanges(entry.first);
 :   vector combined_ranges;
 :   if (computed_ranges != nullptr) {
 : for (const auto& range : entry.second) {
 :   if (!range.scan_range.__isset.hdfs_file_split_spec) {
 : combined_ranges.push_back(range);
 :   }
 : }
 : for (const auto& range : 
(*schedule->GetComputedRanges(entry.first))) {
 :   combined_ranges.push_back(range);
 : }
 : scan_ranges = _ranges;
 :   }
> I think my previous comment made things worse, sorry about that. I think it
no problems. added back here. expansion is done per spec in a separate helper 
function below.


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199
PS4, Line 199: 3: required i64 partition_id
> Comment what this is referencing.
Done


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209
PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec
> Add a comment either here or at the top to describe when THdfsFileSplit vs
Lets go with THdfsFileSplitGeneratorSpec


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126
PS4, Line 126: fileFormat'
> nit: missing quote
Done


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712
PS4, Line 712: fileDesc.getNumFileBlocks()
> What will happen if this is a zero-byte file (no blocks). Can't we prune th
changed it to be more explicit (based on the fs).


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743
PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName())
> Do we need to do this for every file? Can't we do this once per partition?
good point, lifted that to be per partition.


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761
PS4, Line 761: disk ids are expected in blocks,
 :* checkMissingDiskIds should be set to true. If disk ids are 
checked,
> "checkMissingDiskIds is true, "
Done



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Thu, 18 Jan 2018 05:47:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-17 Thread Dimitris Tsirogiannis (Code Review)
Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8523/4//COMMIT_MSG@24
PS4, Line 24: datastructures
nit: data structures


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h
File be/src/scheduling/query-schedule.h:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/query-schedule.h@260
PS4, Line 260: // Map of plan node to scan ranges that are computed from 
request_. If a plan node is
 :   // not present, then it has no associated additional scan 
ranges.
 :   // Populated by the ComputeScanRanges.
 :   std::map 
computed_ranges_;
How is this related to PerNodeScanRanges (L41)?


http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/4/be/src/scheduling/scheduler.cc@258
PS4, Line 258: // Combine actual and computed ranges from the schedule.
 :   const vector* scan_ranges = 

 :   const vector* computed_ranges =
 :   schedule->GetComputedRanges(entry.first);
 :   vector combined_ranges;
 :   if (computed_ranges != nullptr) {
 : for (const auto& range : entry.second) {
 :   if (!range.scan_range.__isset.hdfs_file_split_spec) {
 : combined_ranges.push_back(range);
 :   }
 : }
 : for (const auto& range : 
(*schedule->GetComputedRanges(entry.first))) {
 :   combined_ranges.push_back(range);
 : }
 : scan_ranges = _ranges;
 :   }
I think my previous comment made things worse, sorry about that. I think it's 
fine to move the scan range generation here. In that way, we will avoid the 
double iteration over all plan nodes and the additional state (per node 
computed scan ranges). Just make sure you wrap this into a separate function.


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@199
PS4, Line 199: 3: required i64 partition_id
Comment what this is referencing.


http://gerrit.cloudera.org:8080/#/c/8523/4/common/thrift/PlanNodes.thrift@209
PS4, Line 209: 4: optional THdfsFileSplitSpec hdfs_file_split_spec
Add a comment either here or at the top to describe when THdfsFileSplit vs 
THdfsFileSplitSpec is set. Actually, THdfsFileSplitSpec may not be the best 
name. How about THdfsFileSplitGenerationSpec or something like that?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@126
PS4, Line 126: fileFormat'
nit: missing quote


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@712
PS4, Line 712: fileDesc.getNumFileBlocks()
What will happen if this is a zero-byte file (no blocks). Can't we prune this 
early on or it never reaches that point?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@743
PS4, Line 743: partitionFs.getDefaultBlockSize(new Path(fileDesc.getFileName())
Do we need to do this for every file? Can't we do this once per partition?


http://gerrit.cloudera.org:8080/#/c/8523/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@761
PS4, Line 761: disk ids are expected in blocks,
 :* checkMissingDiskIds should be set to true. If disk ids are 
checked,
"checkMissingDiskIds is true, "



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 17 Jan 2018 19:30:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-11 Thread Vuk Ercegovac (Code Review)
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8523

to look at the new patch set (#4).

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range datastructures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/query-schedule.cc
M be/src/scheduling/query-schedule.h
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M be/src/service/client-request-state.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
9 files changed, 343 insertions(+), 178 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/4
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-11 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc@582
PS3, Line 582: ComputeScanRangeAssignment
> This function now does both assignment and generation of scan ranges. Also,
I moved it to QueryScheduler... that seems to be the place where data about 
scheduling is stored (including the request with scan ranges from the FE). Now, 
the idea is to first generate these ranges (done in client request), then 
schedule. If there is a better place to store such per request info, let me 
know and I can easily move it there.
The only thing the scheduler knows about all of this right now is to stitch 
together computed and given ranges-- that's a bit gorpy (L258) and I plan to 
fix it if this flow/organization looks ok.


http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift@195
PS3, Line 195: struct THdfsFileSplitSpec {
 :   1: required CatalogObjects.THdfsFileDesc file_desc
 :   // -1 means not-splittable
 :   2: required i64 max_block_size
 :   3: required i64 partition_id
 : }
> THdfsFileDesc and THdfsFileSplit have quite a bit of intersection wrt the i
I did this for two reasons:
1) I expect this Spec to be the only thing that comes out of the FE for hdfs 
files (unexpanded) and THdfsFileSplit will only be present on the BE 
(expanded). Currently, I'm doing this just for s3/adls/local, but I'm thinking 
to apply it to all (not this change) hdfs files.
2) I preferred to use the type-checker to tell me which representation I have 
(expanded vs. unexpanded). Otherwise, I'd have to peer into THdfsFileSplit to 
determine this state, which seems more error-prone.


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@131
PS3, Line 131: new int[] {}
> I believe you don't have to do that. You can probably pass null and then av
Done


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@338
PS3, Line 338: its
> typo: it's
Done


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@666
PS3, Line 666: private void generateScanRangeSpecs
> No intention to be picky but can you move this and tansformBlocksToScanRang
Done


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@694
PS3, Line 694:* TScanRangeLocationLists are added scanRanges_. If disk ids 
are expected in blocks
> to
Done



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Fri, 12 Jan 2018 01:19:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2018-01-02 Thread Dimitris Tsirogiannis (Code Review)
Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/8523/3/be/src/scheduling/scheduler.cc@582
PS3, Line 582: ComputeScanRangeAssignment
This function now does both assignment and generation of scan ranges. Also, it 
feels kind of weird to have the generation part in the scheduler. I think it 
may be better to separate these two concepts and move the scan range generation 
part out of here.


http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8523/3/common/thrift/PlanNodes.thrift@195
PS3, Line 195: struct THdfsFileSplitSpec {
 :   1: required CatalogObjects.THdfsFileDesc file_desc
 :   // -1 means not-splittable
 :   2: required i64 max_block_size
 :   3: required i64 partition_id
 : }
THdfsFileDesc and THdfsFileSplit have quite a bit of intersection wrt the 
information they store. I am wondering if it would be simpler to modify 
THdfsFileSplit to handle both cases and avoid THdfsFileSplitSpec altogether.


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@131
PS3, Line 131: new int[] {}
I believe you don't have to do that. You can probably pass null and then avoid 
L143 and L150 with couple of checks.


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@338
PS3, Line 338: its
typo: it's


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@666
PS3, Line 666: private void generateScanRangeSpecs
No intention to be picky but can you move this and tansformBlocksToScanRanges 
below function computeScanRangeLocations so that the code reads in a top-down 
fashion? thanks


http://gerrit.cloudera.org:8080/#/c/8523/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@694
PS3, Line 694:* TScanRangeLocationLists are added scanRanges_. If disk ids 
are expected in blocks
to



--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Dimitris Tsirogiannis 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 03 Jan 2018 00:17:59 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2017-12-20 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..


Patch Set 3:

all tests pass with s3


--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Vuk Ercegovac 
Gerrit-Comment-Date: Wed, 20 Dec 2017 17:19:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls

2017-12-20 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls
..

IMPALA-5931: Generates scan ranges in planner for s3/adls

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range datastructures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

Testing:
- local fs tests cover the code paths in this change
- all core tests pass when configured with s3
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
- TODO: adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
5 files changed, 294 insertions(+), 166 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/3
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac 
Gerrit-Reviewer: Vuk Ercegovac 


[Impala-ASF-CR] IMPALA-5931: Generates scan ranges in planner for s3/adls (wip)

2017-12-13 Thread Vuk Ercegovac (Code Review)
Vuk Ercegovac has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/8523 )

Change subject: IMPALA-5931: Generates scan ranges in planner for s3/adls (wip)
..

IMPALA-5931: Generates scan ranges in planner for s3/adls (wip)

Currently, for filesystems that do not include physical
block information (e.g., block replica locations, caching),
synthetic blocks are generated and stored in the catalog
when metadata is loaded. Example file systems for which this is done
includes S3, ADLS, and local fs.

This change avoids generating these blocks when metadata is loaded.
Instead, scan ranges are directly generated from such files by the
backend coordinator. Previously, all scan ranges were produced by
the planner in HDFSScanNode in the frontend. Now, those files without
block information are sent to the coordinator represented by a split
specification that determines how the coordinator will create scan ranges
to send to executors.

This change reduces the space needed in the catalog and reduces the
scan range datastructures that are passed from the frontend to the
backend when planning and coordinating a query.
In addition a bug is avoided where non-splittable files were being
split anyways to support the query parameter that places a limit on
scan ranges.

The "wip" status is currently to add/run more tests.

Testing:
- local filesystem tests exercise this code path
- manually tried larger local filesystem tables (tpch) with multiple
  partitions and observed the same scan ranges.
  - TODO: s3 and adls testing

Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
---
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
5 files changed, 282 insertions(+), 165 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/8523/2
--
To view, visit http://gerrit.cloudera.org:8080/8523
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I326065adbb2f7e632814113aae85cb51ca4779a5
Gerrit-Change-Number: 8523
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac