[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-27 Thread Joe McDonnell (Code Review)
Joe McDonnell has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Reviewed-on: http://gerrit.cloudera.org:8080/13545
Reviewed-by: Joe McDonnell 
Tested-by: Impala Public Jenkins 
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 332 insertions(+), 55 deletions(-)

Approvals:
  Joe McDonnell: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 14
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 13: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 27 Jun 2019 04:24:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-26 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 13: Code-Review+2

Rebased, carry +2


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4554/ 
DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 26 Jun 2019 03:38:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-25 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 12: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/ 
DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-24 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 25 Jun 2019 00:24:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 11: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 21 Jun 2019 01:22:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 11: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 21 Jun 2019 00:36:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4513/ 
DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 19:53:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3699/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 10
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 17:42:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 9:

(4 comments)

Thanks for the review

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47
PS9, Line 47:
> nit: double space
Done


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18
PS9, Line 18: #include 
> Is this one needed?
Oh, good point, this is leftover from an approach I ended up not using. Removed.


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228
PS9, Line 228:   cluster.AddHosts(num_data_nodes, false, true);
> nit: use AddRemoteCluster()?
Switched this over. Documented that CreateRemoteCluster() puts the Impala nodes 
first, so the indices for the AddSingleBlockTable() call needed to change.


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376
PS9, Line 376:   cluster.AddHosts(num_impala_nodes, true, false);
> nit: AddRemoteCluster()?
Good catch, these statements shouldn't be here, as this uses 
CreateRemoteCluster().



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 17:02:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-20 Thread Joe McDonnell (Code Review)
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#10).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 332 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/10
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 10
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 9: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 06:41:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3689/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:36:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3690/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:31:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 9: Code-Review+1

(4 comments)

Thanks for the changes. I had few more nits but otherwise the change looks good 
to me.

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47
PS9, Line 47:
nit: double space


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18
PS9, Line 18: #include 
Is this one needed?


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228
PS9, Line 228:   cluster.AddHosts(num_data_nodes, false, true);
nit: use AddRemoteCluster()?


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376
PS9, Line 376:   cluster.AddHosts(num_impala_nodes, true, false);
nit: AddRemoteCluster()?



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:21:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4503/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:15:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
> Thx. I think I'd prefer to spell out the size of the cluster by calling Cre
Removed the CreateStandardRemoteCluster and changed tests to call 
CreateRemoteCluster directly.



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:14:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#9).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 329 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/9
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: /// Helper function to verify that two things are treated as 
distinct for consistent
> Moved this to a static function on Cluster. I added a CreateStandardRemoteC
Thx. I think I'd prefer to spell out the size of the cluster by calling 
CreateRemoteCluster(50, 30), similar to how we have several calls to 
cluster.AddHosts(3, true, true) through out this file. This makes it easier for 
the reader to see the cluster size without having to check another file.



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:02:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24
PS7, Line 24: see
> nit: sees?
Done


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h
File be/src/scheduling/scheduler-test-util.h:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95
PS7, Line 95: PARTITIONED_SIMPLE_NAMES
> PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES?
Changed these to PARTITIONED_SINGLE_FILENAME and PARTITIONED_UNIQUE_FILENAMES.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295
PS7, Line 295: // Encoding the table name and index in the file helps 
debugging.
> Maybe include one or two examples of a path in the comment in each case? Th
Added examples for each section


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353
PS7, Line 353:   GetBlockPaths(table_name, true, spec_idx, naming_policy, 
&relative_path,
> One of the parameters could be returned by value, e.g. the partition_id
I think I prefer returning multiple things all via args rather than having one 
via return value, so I'm going to leave this.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
> This could be a member of Cluster, e.g. static Cluster CreateRemoteCluster(
Moved this to a static function on Cluster. I added a 
CreateStandardRemoteCluster for the 50 impala node, 3 data node configuration.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305
PS7, Line 305:
> Can you add brief comments above each test describing what to do (see other
Added comments for the new tests


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314
PS7, Line 314: Schema schema(cluster);
> These loops could benefit from adding a SCOPED_TRACE for the naming policy
Added SCOPED_TRACE for the naming policy.


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195
PS7, Line 195: partition's path
> For S3, this could be a bucket? Does it matter whether it uses the whole pa
The partition_id is something that Impala assigns relatively arbitrarily and it 
is not really stable across time. For example, it is not stable across an 
"invalidate metadata" for a table. So, it does need to be the path.


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196
PS7, Line 196: consistent
> "consistent" here mean "given the same input, must return the same hash", r
Right



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 20 Jun 2019 00:55:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#8).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 330 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/8
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 23:42:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24
PS7, Line 24: see
nit: sees?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h
File be/src/scheduling/scheduler-test-util.h:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95
PS7, Line 95: PARTITIONED_SIMPLE_NAMES
PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295
PS7, Line 295: // Encoding the table name and index in the file helps 
debugging.
Maybe include one or two examples of a path in the comment in each case? That 
might make the intent easier to follow.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353
PS7, Line 353:   GetBlockPaths(table_name, true, spec_idx, naming_policy, 
&relative_path,
One of the parameters could be returned by value, e.g. the partition_id


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
This could be a member of Cluster, e.g. static Cluster 
CreateRemoteCluster(num_impala_nodes, num_data_nodes).


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305
PS7, Line 305:
Can you add brief comments above each test describing what to do (see other 
tests in this file for examples)?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314
PS7, Line 314: Schema schema(cluster);
These loops could benefit from adding a SCOPED_TRACE for the naming policy 
(https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#adding-traces-to-assertions).


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195
PS7, Line 195: partition's path
For S3, this could be a bucket? Does it matter whether it uses the whole path 
or just the partition id? Should we just call it "partition_hash"?


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196
PS7, Line 196: consistent
"consistent" here mean "given the same input, must return the same hash", right?



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 22:29:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3680/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:43:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4499/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:11:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267
PS6, Line 267: // Set of Impala hosts
> Extra space indentation
Done


http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282
PS6, Line 282: /// overlapping by chance is extremely low, so this test should 
only fail if the
> Can we control how the RNG used for the random replica selection is seeded?
Good point. I was thinking the probability of overlap should be small enough 
not to matter (~10^-18 or so), but then I noticed that we do srand(0) in 
SchedulerTest constructor:
https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler-test.cc#L31
The constructor runs for each test, so I guess these are deterministic.

Changed this comment to indicate this is deterministic (which is fine, because 
these tests don't benefit much from actual randomness).


http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195
PS6, Line 195:   // hash of the partition's path
> I think it would be useful to document how the hash should be computed (sin
Added a comment about using Java String.hashCode() and that it needs to be 
consistent across processes/machines.


http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951
PS6, Line 951: partition.getLocation().hashCode());
> This was surprisingly subtle. I remembered that hashCode() was not guarante
Good point, I added a comment in the thrift file.



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:03:11 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Joe McDonnell (Code Review)
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#7).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 281 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/7
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 6:

(4 comments)

Tests look great, thanks for adding them. I had a few minor requests but can +2 
after you've addressed those.

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267
PS6, Line 267: // Set of Impala hosts
Extra space indentation


http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282
PS6, Line 282: /// overlapping by chance is extremely low, so this test should 
only fail if the
Can we control how the RNG used for the random replica selection is seeded? 
E.g. call srand() with a seed that we control before each run. Otherwise it 
might be hard to reproduce if we have occasional collisions (e.g. RNG isn't as 
good as we think and has more collisions than we'd expect).

This isn't a blocker if it turns out to be tricky, but it's a nice-to-have.


http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195
PS6, Line 195:   // hash of the partition's path
I think it would be useful to document how the hash should be computed (since 
we rely on it being computed the same way). I.e. that it's the standard Java 
String.hashCode()


http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951
PS6, Line 951: partition.getLocation().hashCode());
This was surprisingly subtle. I remembered that hashCode() was not guaranteed 
to be consistent across JVMs: 
https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html.
 But it appears that the String hashCode is actually documented as being 
stable: 
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#hashCode--

I think it is worth documenting, but might be best to document in the thrift 
spec instead of here.



--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 16:48:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 6: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 05:17:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 5
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 03:06:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3674/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 19 Jun 2019 00:30:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4496/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 18 Jun 2019 23:57:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Joe McDonnell (Code Review)
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#6).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 277 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/6
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3670/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 18 Jun 2019 22:16:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 5
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 18 Jun 2019 21:43:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

2019-06-18 Thread Joe McDonnell (Code Review)
Hello Tim Armstrong, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#4).

Change subject: IMPALA-8630: Hash the full path when calculating consistent 
remote placement
..

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
6 files changed, 276 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/4
--
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon