[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#13). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator-ir.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/grouping-aggregator.h M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 12 files changed, 812 insertions(+), 86 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/13 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 13 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5832/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 12 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 20 Apr 2020 19:13:36 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 12: New change in Patch Set 12: - Instead of measuring fill ratio as percentage, count the number of filled bucket instead (counter FilledBuckets). - Add InsertTravel counter to measure how much travel involved in insertion path. Consequently, number of travel in lookup-only can be computed as (Travel - InsertTravel). - Try to speedup bucket rebalancing process by displacing buckets altogether instead of swapping one-by-one. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 12 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 20 Apr 2020 18:36:28 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#12). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator-ir.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/grouping-aggregator.h M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 12 files changed, 832 insertions(+), 92 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/12 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 12 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 11: Patch Set 11 add improvement for insertion into Robin-Hood hash table. If an insertion target a bucket that is an empty, do insertion in-place, bypassing temporary bucket and swap routine. This reduce unnecessary number of bucket swaps. Based on Patch Set 11, I have run initial performance measurement against TPC-DS with scale 60 GB over 3 nodes. The baseline is TPC-DS under quadratic probing hash table. Among 91 queries, 34 queries perform better in Robin-Hood than Quadratic, 57 others show degradation. The breakdown are the following: - 1 query improve by 18.69% - 3 queries improve between 11.29% to 14.21% - 8 queries improve between 5.90% to 9.93% - 22 queries improve between 0.02% to 4.14% - 42 queries degrade 0% to -4.86% - 8 queries degrade between -5.51% to -8.50% - 6 queries degrade between -10.49% to -14.89% - 1 queries degrade between -23.50% The improvement and degradations mostly still within tens and hundreds of miliseconds, so I suspect larger test scale will be more representative. Meanwhile, I will study the query profile from my runs and look if we can find some explanation about this result. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 11 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 06 Apr 2020 17:31:09 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5708/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 11 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 06 Apr 2020 17:25:53 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#11). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 793 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/11 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 11 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5691/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 10 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 02 Apr 2020 15:27:55 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 10: (5 comments) Patch Set 10 adds several things: - Fix bug where we forgot to swap distance (PSL) variable as we swap bucket. - Create a separate ProbeRobinHood function for robin-hood algorithm. - Add some new profile counters to measure the cost of rebalancing. Below, I also add some of my own comment indicating several of DCHECK hacks that I'm not fully understand, but works. Patch set 10 pass core tests with this hacks, but we should definitely revisit them later to fully verify their correctness. http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92 PS9, Line 92: (LIKELY(step < nu > In case of FindBuildRowBucket, it is quite tricky, because the caller of Fi I went ahead by creating separate ProbeRobinHood function to solve this. http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132 PS9, Line 132: if we got > You are correct, thanks for pointing this bug! Will fix this in my next sub Done http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc@1316 PS10, Line 1316: DCHECK_EQ(replaced_constants.robin_hood, 0); I'm not sure if the replacement count here should be 0. But looking at priors DCHECKs, it is probably right. http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc@1393 PS10, Line 1393: DCHECK_REPLACE_COUNT(replaced, 2); In Patch Set 10, I create a second Probe function special for Robin-Hood algorithm. This additions seems to add a second call-site to function "Equals". Changing this DCHECK and other related checks to 2 pass me from errors. http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-node.cc@1543 PS10, Line 1543: DCHECK(replaced == 1 || replaced == 2 || replaced == 3 || replaced == 4) << replaced; Similarly, since I'm adding a new Probe function, I think I should change this DCHECK, but I'm not sure what is the right check. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 10 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 02 Apr 2020 15:17:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#10). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 777 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/10 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 10 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92 PS9, Line 92: return bucket_idx > Agree that it becomes too much overloaded just to fit in robin-hood case. M In case of FindBuildRowBucket, it is quite tricky, because the caller of FindBuildRowBucket is the one who decide what to do next given the returned value "found". If found equals true, the caller might continue to call GetTuple or UpdateTuple. If found equals false, the caller might call SetTuple next, which essentially an insert operation. In linear/quadratic mode, the iterator should point to an empty bucket before SetTuple is called. But in robin-hood, I hack it such that it point to the non-empty bucket that should be evicted if SetTuple is called. I can see that this is where the contract of return value of Probe becomes inconsistent, depending of what is the probe algorithm being used. Maybe the cleanest way is to create separate Probe function, say ProbeRobinHood, and use it when robin-hood algorithm is used. The downside is we will have a little bit code duplication between the two probe functions. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 02 Apr 2020 04:13:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92 PS9, Line 92: return bucket_idx > I understand this is the short-cut case for insert, and returning the index Agree that it becomes too much overloaded just to fit in robin-hood case. Maybe it is better to return the swap target bucket as an additional return parameter next to parameter found? http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132 PS9, Line 132: my_distance > Should my_distance be set to the BucketDistance of the new bucket in the ca You are correct, thanks for pointing this bug! Will fix this in my next submission. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 02 Apr 2020 02:23:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92 PS9, Line 92: return bucket_idx I understand this is the short-cut case for insert, and returning the index of a non-empty bucket might work here when called during insert, but it may not follow the contract described for the Probe() return value (see the comment in hash-table.h). Calls to Probe during lookup won't necessarily check the value of "found" and might not expect a return that's not one of (a) a match, (b) an empty bucket, or (c) BUCKET_NOT FOUND. For example not clear whether call from FindBuildRowBucket would expect a non-empty bucket as an indication of probe failure. I'm not sure what the best solution is given the need to return a swap target bucket somehow in the insert case, but it seems like we should do something to make the contract for the return value clearer and less overloaded. http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132 PS9, Line 132: my_distance Should my_distance be set to the BucketDistance of the new bucket in the case where we swap? -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 02 Apr 2020 01:41:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5665/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Mar 2020 18:45:38 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 9: (2 comments) Hi David, thank you for your feedback. Below is summary of Patch Set 6 to 9. Patch Set 6: - Create FLAGS_enable_robin_hood to enable Robin Hood hash table. - Create backend tests for Robin Hood hash table. Patch Set 7: - Fix clang-tidy warnings about tmp_bucket_ struct initialization. Patch Set 8: - Add MaxTravel profile counter Patch Set 9: - Add improvement to update probe stats just once prior to Probe return. http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG@8 PS7, Line 8: > I suggest we add an additional profile counter to capture the max travel di Done http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h@97 PS8, Line 97: } > Minor optimization suggestion - you should be able to avoid recalculating t Done -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Mar 2020 18:20:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#9). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 691 insertions(+), 82 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/9 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h@97 PS8, Line 97: ht_ctx->max_travel_length_ = std::max(step, ht_ctx->max_travel_length_); Minor optimization suggestion - you should be able to avoid recalculating the max on every iteration of the loop and update it just once per probe call prior to the returns. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 8 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Mar 2020 17:16:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5657/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 8 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Mar 2020 17:12:29 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#8). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 675 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/8 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 8 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG@8 PS7, Line 8: I suggest we add an additional profile counter to capture the max travel distance for any single call to probe. This will help verify that robin-hood is in fact reducing the probe length variance as expected. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 7 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Mar 2020 23:52:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5647/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 7 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Mar 2020 21:08:51 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#7). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 665 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/7 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 7 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 6: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/5643/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Mar 2020 16:33:01 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#6). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Add backend test for Robin Hood hash table in hash-table-test.cc - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table-test.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 10 files changed, 664 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/6 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 6 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 5: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/5632/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Mar 2020 05:18:39 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 5: (1 comment) Patch Set 5 submitted. What is different from Patch Set 4 is that now we can do single pass insert both in HashTable::Insert and HashTable::Iterator::SetTuple. Right now, this patch is hijacking the FLAGS_enable_quadratic_probing to enable robin hood mode rather than the quadratic probing mode. I plan to create separate flag for robin hood mode in next iteration, so that we can do performance comparison between linear, quadratic, and robin hood linear probe. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: If a hash table is configured as a robin hood hash table, the new > I'm still thinking how to do single pass insert. From what I understand, we This is now addressed in Patch Set 5. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 28 Mar 2020 04:48:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#5). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. If a hash table is configured as a robin hood hash table, the new element insertion will be buffered in a temporary bucket. This temporary bucket will then matched against existing bucket elements, swapped with a "rich" bucket, and continue doing so until it swap with an empty bucket. The PSL (probe sequence length) invariant is maintained in robin hood hash table. This allow us to add short-circuit in the Probe function to immediately returns when it finds out that the PSL of currently visited bucket is smaller/richer than the key that is being looked up, indicating that the key does not exist in the table. Instead of continue probing until next empty bucket is found, Probe can immediately the return the index of recently visited richer bucket to caller along with the not found flag. In turn, the caller use this returned index to specify in which index the new element should be inserted to. Testing: - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-node.cc M be/src/exprs/scalar-expr.h 9 files changed, 560 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/5 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact. > I've been looking at this. It looks like short-circuiting lookup only possi Agree that supporting quadratic probing + robin hood probably doesn't make sense. I think if we get linear probing + robin hood working well we could consider just switching to that (since quadratic probing exists mainly to solve some issues with linear probing and clusters of values). The other alternative would be to support two code paths, but it's preferable not to have both ways if it's not needed. IMPALA-2470 has context on the problem and how to reproduce it. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 26 Mar 2020 20:59:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 4: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/5601/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 26 Mar 2020 01:37:30 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 4: (6 comments) Patch set 4 submitted. This patch address some of comments from patch set 3, and add initial microbenchmarking code. After some exploration, I come to conclusion that quadratic probing does not fit well with robin hood hashing. We might be better off by dropping quadratic probing entirely and commit on robin hood linear probing only. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: bucket rebalancing after every insert. > You told me before about short-circuiting the lookup but I didn't quite get I've been looking at this. It looks like short-circuiting lookup only possible in linear probing mode because it exploit that linear probing behavior. Therefore, we need to drop quadratic probing mode first before we can fully implement all improvement that is possible in Robin Hood, linear probing. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the > Also I think when probing for lookup we might consider trying the "start in I'm still thinking how to do single pass insert. From what I understand, we should tweak the insertion pattern that is using Iterator. For example, FindProbeRow(), FindBuildRowBucket() should commit early whether they indeed plan to do insert or just testing existence. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115 PS3, Line 115: new_distance <= target_d > Ah, I see what you mean. Somehow I was under impression that element that i Done http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: // swapped back into temporary location at index curr_idx. > Sorry, this example is bad, because the final balanced bucked should be [-1 This check, unfortunately, need to stay because it is possible to swap with empty bucket in the case of quadratic probing. We should iron this out if we decide to drop quadratic probing. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: } else { > Wonder if we should consider storing the bucket distance (calculated during Patch Set 4 address O(1) distance measurement for linear probing. For quadratic probing, because the probe steps follow Triangular number pattern, we supposedly can do ~O(1) by doing its inverse function https://en.wikipedia.org/wiki/Triangular_number#Triangular_roots_and_tests_for_triangular_numbers However, I'm leaning towards dropping quadratic probing entirely, so I keep it as loop for quadratic probing. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: DCHECK(node_ != NULL); > > And hash-table suppose to be thread-safe for read access. Add rebalancing at the end of Iterator::SetTuple and it pass core tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 26 Mar 2020 01:25:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Hello David Rorke, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15511 to look at the new patch set (#4). Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This patch is intended to be the first step towards full robin hood hash table implementation by doing bucket rebalancing after every insert. Instead of proactively swapping elements during insertion, the algorithm does the rebalancing by first letting the new element probe an empty bucket and claim it through PrepareBucketForInsert(). After the bucket at index curr_idx claimed, the rebalancing algorithm in function Rebalance() walk through the probe sequence again to find richer or empty bucket to swap with bucket[curr_idx]. The search and swapping continue until it swap with empty bucket or does not find any richer bucket along the probe sequence. It effectively treat bucket[curr_idx] as temporary variable during the rebalancing process. Rebalance() return the new index of the recently claimed bucket after rebalancing process done. The decision to not swap elements proactively is because the probe function is used for both insertion and lookup-only. Keeping probing routine decoupled from insertion routine allow us to invoke rebalance only after bucket claim, not on lookup-only. This patch also maintain compatibility with quadratic probing. However, it block us from leveraging some robin hood technique that only possible with linear probing such as lookup short-circuit (fast lookup on nonexistent keys). Testing: - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/hash-table-benchmark.cc M be/src/exec/hash-table.cc M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h M be/src/exprs/scalar-expr.h 6 files changed, 610 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/4 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: if (curr_bucket->filled) { > I just remember that it is possible to swap with an empty bucket, especiall Sorry, this example is bad, because the final balanced bucked should be [-1, 1, 1, 2] A better example maybe is table with size 8, quadratic probe, with incoming element 1, 1, 1, 1, 7, 7. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Mar 2020 01:50:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the > Yes, my intuition is also that doing the additional work to create a separa Also I think when probing for lookup we might consider trying the "start in the middle / smart search" approach described here: https://programming.guide/robin-hood-hashing.html The tradeoff is it might be less cache friendly. But it might be best to explore this as a follow on change later just to keep the scope of the initial change manageable. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Mar 2020 01:43:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the > Having separate Probe() function that is special for insert might remove th Yes, my intuition is also that doing the additional work to create a separate single pass probe for insert and also making sure we maintain the distance invariant in all cases are both worthwhile. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: inline int64_t HashTable::BucketDistance( > I'll see if I can put O(1) distance measurement, at least for the linear pr Wonder if we should consider storing the bucket distance (calculated during insert/rebalance) in the bucket itself. This might work against the goals of IMPALA-7635 but with a carefully packed memory layout maybe we could still squeeze in a byte sized value for the distance. Seems intuitively like a byte should be enough to store most distances, and could fall back to calculating the distance via probe in rare cases where the distance won't fit in a byte. Of course if you can figure out a good enough O(1) approach that doesn't require any storage that might be simpler. Regarding linear vs quadratic I'd also hope that robin-hood doesn't require quadratic. Maybe we should just do some benchmarking and see if there's a clear winner between different probing algorithms and if so go with that. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Mar 2020 01:36:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: if (curr_bucket->filled) { > Good point, I must have miss this while I refine my code. Will change this I just remember that it is possible to swap with an empty bucket, especially in quadratic_probing mode. Consider hash-table with buckets size 4, quadratic probing, and incoming elements 1, 1, 2 in that order. Let say -1 represent empty bucket and hash of an element is equal to that element itself. Initially the buckets look like this: [-1, -1, -1, -1] After 1 and 1 inserted to the table, buckets will look like this [-1, 1, 1, -1] Now, when we want to insert 2 to table, this algorithm will put 2 at the last empty bucket in the probe sequence, that is index 3. [-1, 1, 1, 2] Then, we rebalance by swapping 2 at index 3 with 1 at index 2 [-1, 1, 2, 1] Now, the element 1 at index 3 is misplaced, because if we follow quadratic_probing, the appropriate probe sequence for 1 should be 1, 2, 0 (that is (2 + 2) mod 4), 3, and so on. So we need to swap 1 at index 3 with the empty bucket at index 0. [1, 1, 2, -1] At this point, bucket index 3 is an empty bucket and the rebalancing finish. This is kind of counter intuitive, because most literature show that robin-hood should pair with linear probing. And in linear probing, this case will not happen. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 25 Mar 2020 01:28:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: table_->PrepareBucketForInsert(bucket_idx_, hash); > And hash-table suppose to be thread-safe for read access. Yeah, it definitely needs to be thread-safe if muliontiple threads are reading from it, but mutations (like SetTuple()) do not need to be thread-safe. We can also document SetTuple() as invalidating other iterators, cause we don't depend on that either. SetTuple() is only used by the hash aggregator - see be/src/exec/grouping-aggregator-ir.cc. The algorithm is basically this: it = ht->FindBucket(...); if (found in hash table) { // Merge into the existing intermediate tuple UpdateTuple(it, input_row) } else { // Try to construct a new intermediate tuple new_tuple = TryConstructTuple() if (tuple construction failed due to OOM) { SpillRow(input_row) } else { it.SetTuple(new_tuple) UpdateTuple(it, input_row) } } Instead of FindBucket()/SetTuple() you could do Find()/Insert() but that would probe the hash table twice. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Mar 2020 21:42:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (6 comments) Hi Tim, thanks for your valuable feedbacks! http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact. > It's worth noting that this isn't the full robin-hood hashing approach, bec You told me before about short-circuiting the lookup but I didn't quite get it then. Now that you put it that way, I understand better what you mean. Yes, we can do the short-circuit lookup if we maintain the distance invariant. Right now, the invariant is not maintained because insert using iterator does not trigger rebalance. I'll check the code again and see if I can do something about it. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the > I guess we can revisit this to see what the impact is, it seems like it pro Having separate Probe() function that is special for insert might remove the redundant second pass. I'll see if this is possible. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115 PS3, Line 115: target_distance == 0 || > Why do we need this condition? My understanding was that you always want to Ah, I see what you mean. Somehow I was under impression that element that is already perfectly placed should not be evicted. Will remove this condition. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: if (curr_bucket->filled) { > Is it possible for this to be false? Why would there be an empty bucket *be Good point, I must have miss this while I refine my code. Will change this with DCHECK. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: inline int64_t HashTable::BucketDistance( > I guess calculating this for linear probing is relatively straightforward, I'll see if I can put O(1) distance measurement, at least for the linear probing. I did tried that before, and it failed PlannerTests. But that probably my implementation that still incorrect. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: table_->PrepareBucketForInsert(bucket_idx_, hash); > You could also rebalance there, I think, to benefit the agg. Or combine the If we rebalance here, it will reorder elements in array buckets_, which in turn breaking the order of iterator. And hash-table suppose to be thread-safe for read access. I'm worried if there are two iterator going where one do read only and the other do write through this SetTuple() function, the later iterator will break the former as well. Please correct me if my assumption is wrong, or if there is some guarantee that such case is impossible. I think I haven't fully understand the use case of this function. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Mar 2020 19:03:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact. It's worth noting that this isn't the full robin-hood hashing approach, because we're not short-circuiting the lookup based on the invariant (i.e. once the max distance of a hash value you'd seen on the probe exceeds your current distance, you know the key cannot be in the hash table). I think this may be a problem if you do benchmarks where most of the lookups are not present in the hash table. This might be significant for some workloads. But I guess for hash joins, this will give most of the benefit, if we assume that missing values are mostly filtered out by bloom filters in the scan. So if we see a big impact, we can look at optimising it further. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the I guess we can revisit this to see what the impact is, it seems like it probably means that insert is somewhat more expensive because of the two passes. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115 PS3, Line 115: target_distance == 0 || Why do we need this condition? My understanding was that you always want to maintain the invariant that the richer key gets bumped. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: if (curr_bucket->filled) { Is it possible for this to be false? Why would there be an empty bucket *before* the current position of the key (given that we don't support deletion from the hash table). I.e. Maybe this should be a DCHECK to enforce the invariant. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: inline int64_t HashTable::BucketDistance( I guess calculating this for linear probing is relatively straightforward, but not sure if there's a good way to calculate it for quadratic probing (quadratic probing + robin hood hashing might not be a great combination). We did the quadratic probing because of a tendency for the CRC-based hash to cluster. I don't know if robin-hood hashing is enough to counter-act that tendency. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: table_->PrepareBucketForInsert(bucket_idx_, hash); You could also rebalance there, I think, to benefit the agg. Or combine the rebalancing with PrepareBucketForInsert -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Mar 2020 16:23:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 ) Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/5557/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Sun, 22 Mar 2020 18:13:32 + Gerrit-HasComments: No
[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.
Riza Suminto has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15511 Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. .. WIP IMPALA-9434: Implement Robin Hood Hash Table. Robin hood hashing reduces the variances of probe lengths by continually rebalancing elements. This commit implement robin hood hashing behavior into exec/hash-table.h with small modification to the robin hood algorithm to keep probe function of hash-table intact. Instead of proactively swapping elements during insertion, the algorithm does the rebalancing by first letting the new element probe an empty bucket and claim it through PrepareBucketForInsert(). After the bucket at index curr_idx claimed, the rebalancing algorithm in function Rebalance() walk through the probe sequence again to find richer or empty bucket to swap with bucket[curr_idx]. The search and swapping continue until it swap with empty bucket or does not find any richer bucket along the probe sequence. It effectively treat bucket[curr_idx] as temporary variable during the rebalancing process. Rebalance() return the new index of the recently claimed bucket after rebalancing process done. The decision to not swap elements proactively is because the probe function is used for both insertion and lookup-only. Keeping probing routine decoupled from insertion routine allow us to invoke rebalance only after bucket claim, not on lookup-only. Insert into an already filed bucket (duplicates), or insert using hash-table iterator does not trigger rebalance. Testing: - Pass core tests Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 --- M be/src/exec/hash-table.h M be/src/exec/hash-table.inline.h 2 files changed, 94 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/3 -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Riza Suminto