[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling sees blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Reviewed-on: http://gerrit.cloudera.org:8080/13545 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 332 insertions(+), 55 deletions(-) Approvals: Joe McDonnell: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 14 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 13: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 13 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 27 Jun 2019 04:24:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 13: Code-Review+2 Rebased, carry +2 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 13 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 13: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4554/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 13 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 12: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/ -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 12 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 26 Jun 2019 03:38:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 12: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 12 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 12 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 11 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 25 Jun 2019 00:24:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 11: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 11 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 21 Jun 2019 01:22:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 11: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 11 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 21 Jun 2019 00:36:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 11: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4513/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 11 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 19:53:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3699/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 10 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 17:42:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 9: (4 comments) Thanks for the review http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc File be/src/scheduling/scheduler-test-util.cc: http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47 PS9, Line 47: > nit: double space Done http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18 PS9, Line 18: #include > Is this one needed? Oh, good point, this is leftover from an approach I ended up not using. Removed. http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228 PS9, Line 228: cluster.AddHosts(num_data_nodes, false, true); > nit: use AddRemoteCluster()? Switched this over. Documented that CreateRemoteCluster() puts the Impala nodes first, so the indices for the AddSingleBlockTable() call needed to change. http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376 PS9, Line 376: cluster.AddHosts(num_impala_nodes, true, false); > nit: AddRemoteCluster()? Good catch, these statements shouldn't be here, as this uses CreateRemoteCluster(). -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 17:02:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#10). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling sees blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 332 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/10 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 10 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 9: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 06:41:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3689/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 8 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:36:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3690/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:31:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 9: Code-Review+1 (4 comments) Thanks for the changes. I had few more nits but otherwise the change looks good to me. http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc File be/src/scheduling/scheduler-test-util.cc: http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47 PS9, Line 47: nit: double space http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18 PS9, Line 18: #include Is this one needed? http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228 PS9, Line 228: cluster.AddHosts(num_data_nodes, false, true); nit: use AddRemoteCluster()? http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376 PS9, Line 376: cluster.AddHosts(num_impala_nodes, true, false); nit: AddRemoteCluster()? -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:21:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 9: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4503/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:15:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264 PS7, Line 264: void InitRemoteCluster(Cluster *cluster) { > Thx. I think I'd prefer to spell out the size of the cluster by calling Cre Removed the CreateStandardRemoteCluster and changed tests to call CreateRemoteCluster directly. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:14:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#9). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling sees blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 329 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/9 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 9 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264 PS7, Line 264: /// Helper function to verify that two things are treated as distinct for consistent > Moved this to a static function on Cluster. I added a CreateStandardRemoteC Thx. I think I'd prefer to spell out the size of the cluster by calling CreateRemoteCluster(50, 30), similar to how we have several calls to cluster.AddHosts(3, true, true) through out this file. This makes it easier for the reader to see the cluster size without having to check another file. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 8 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 01:02:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: (9 comments) http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24 PS7, Line 24: see > nit: sees? Done http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h File be/src/scheduling/scheduler-test-util.h: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95 PS7, Line 95: PARTITIONED_SIMPLE_NAMES > PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES? Changed these to PARTITIONED_SINGLE_FILENAME and PARTITIONED_UNIQUE_FILENAMES. http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc File be/src/scheduling/scheduler-test-util.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295 PS7, Line 295: // Encoding the table name and index in the file helps debugging. > Maybe include one or two examples of a path in the comment in each case? Th Added examples for each section http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353 PS7, Line 353: GetBlockPaths(table_name, true, spec_idx, naming_policy, &relative_path, > One of the parameters could be returned by value, e.g. the partition_id I think I prefer returning multiple things all via args rather than having one via return value, so I'm going to leave this. http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264 PS7, Line 264: void InitRemoteCluster(Cluster *cluster) { > This could be a member of Cluster, e.g. static Cluster CreateRemoteCluster( Moved this to a static function on Cluster. I added a CreateStandardRemoteCluster for the 50 impala node, 3 data node configuration. http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305 PS7, Line 305: > Can you add brief comments above each test describing what to do (see other Added comments for the new tests http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314 PS7, Line 314: Schema schema(cluster); > These loops could benefit from adding a SCOPED_TRACE for the naming policy Added SCOPED_TRACE for the naming policy. http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195 PS7, Line 195: partition's path > For S3, this could be a bucket? Does it matter whether it uses the whole pa The partition_id is something that Impala assigns relatively arbitrarily and it is not really stable across time. For example, it is not stable across an "invalidate metadata" for a table. So, it does need to be the path. http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196 PS7, Line 196: consistent > "consistent" here mean "given the same input, must return the same hash", r Right -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 20 Jun 2019 00:55:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#8). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling sees blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 330 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/8 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 8 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 23:42:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: (9 comments) http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24 PS7, Line 24: see nit: sees? http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h File be/src/scheduling/scheduler-test-util.h: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95 PS7, Line 95: PARTITIONED_SIMPLE_NAMES PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES? http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc File be/src/scheduling/scheduler-test-util.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295 PS7, Line 295: // Encoding the table name and index in the file helps debugging. Maybe include one or two examples of a path in the comment in each case? That might make the intent easier to follow. http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353 PS7, Line 353: GetBlockPaths(table_name, true, spec_idx, naming_policy, &relative_path, One of the parameters could be returned by value, e.g. the partition_id http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264 PS7, Line 264: void InitRemoteCluster(Cluster *cluster) { This could be a member of Cluster, e.g. static Cluster CreateRemoteCluster(num_impala_nodes, num_data_nodes). http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305 PS7, Line 305: Can you add brief comments above each test describing what to do (see other tests in this file for examples)? http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314 PS7, Line 314: Schema schema(cluster); These loops could benefit from adding a SCOPED_TRACE for the naming policy (https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#adding-traces-to-assertions). http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195 PS7, Line 195: partition's path For S3, this could be a bucket? Does it matter whether it uses the whole path or just the partition id? Should we just call it "partition_hash"? http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196 PS7, Line 196: consistent "consistent" here mean "given the same input, must return the same hash", right? -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 22:29:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3680/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 18:43:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 7: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4499/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 18:11:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267 PS6, Line 267: // Set of Impala hosts > Extra space indentation Done http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282 PS6, Line 282: /// overlapping by chance is extremely low, so this test should only fail if the > Can we control how the RNG used for the random replica selection is seeded? Good point. I was thinking the probability of overlap should be small enough not to matter (~10^-18 or so), but then I noticed that we do srand(0) in SchedulerTest constructor: https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler-test.cc#L31 The constructor runs for each test, so I guess these are deterministic. Changed this comment to indicate this is deterministic (which is fine, because these tests don't benefit much from actual randomness). http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195 PS6, Line 195: // hash of the partition's path > I think it would be useful to document how the hash should be computed (sin Added a comment about using Java String.hashCode() and that it needs to be consistent across processes/machines. http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951 PS6, Line 951: partition.getLocation().hashCode()); > This was surprisingly subtle. I remembered that hashCode() was not guarante Good point, I added a comment in the thrift file. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 18:03:11 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#7). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling see blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 281 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/7 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 6: (4 comments) Tests look great, thanks for adding them. I had a few minor requests but can +2 after you've addressed those. http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc File be/src/scheduling/scheduler-test.cc: http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267 PS6, Line 267: // Set of Impala hosts Extra space indentation http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282 PS6, Line 282: /// overlapping by chance is extremely low, so this test should only fail if the Can we control how the RNG used for the random replica selection is seeded? E.g. call srand() with a seed that we control before each run. Otherwise it might be hard to reproduce if we have occasional collisions (e.g. RNG isn't as good as we think and has more collisions than we'd expect). This isn't a blocker if it turns out to be tricky, but it's a nice-to-have. http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195 PS6, Line 195: // hash of the partition's path I think it would be useful to document how the hash should be computed (since we rely on it being computed the same way). I.e. that it's the standard Java String.hashCode() http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951 PS6, Line 951: partition.getLocation().hashCode()); This was surprisingly subtle. I remembered that hashCode() was not guaranteed to be consistent across JVMs: https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html. But it appears that the String hashCode is actually documented as being stable: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#hashCode-- I think it is worth documenting, but might be best to document in the thrift spec instead of here. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 16:48:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 05:17:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 5: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/ -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 03:06:58 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3674/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 19 Jun 2019 00:30:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 6: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4496/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 18 Jun 2019 23:57:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#6). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling see blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/ExplainTest.java 7 files changed, 277 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/6 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 6 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3670/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 4 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 18 Jun 2019 22:16:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 ) Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. Patch Set 5: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 18 Jun 2019 21:43:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement
Hello Tim Armstrong, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13545 to look at the new patch set (#4). Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement .. IMPALA-8630: Hash the full path when calculating consistent remote placement Consistent remote placement currently uses the relative filename within a partition for the consistent hash. If the relative filenames for different partitions have a simple naming scheme, then multiple partitions may have files of the same name. This is true for some tables written by Hive (e.g. in our minicluster the tpcds.store_sales has this problem). This can lead to unbalanced placement of remote ranges. This adds a partition_path_hash to the THdfsFileSplit and THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of the partition information). The scheduler hashes this in addition to the relative path. Testing: - Added several new scheduler tests that verify the consistent remote scheduling see blocks with different relative paths, partition paths, or offsets as distinct. - Ran core tests Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b --- M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler-test.cc M be/src/scheduling/scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java 6 files changed, 276 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/4 -- To view, visit http://gerrit.cloudera.org:8080/13545 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b Gerrit-Change-Number: 13545 Gerrit-PatchSet: 4 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon