Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 )
Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. ...................................................................... Patch Set 4: (6 comments) Patch set 4 submitted. This patch address some of comments from patch set 3, and add initial microbenchmarking code. After some exploration, I come to conclusion that quadratic probing does not fit well with robin hood hashing. We might be better off by dropping quadratic probing entirely and commit on robin hood linear probing only. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: bucket rebalancing after every insert. > You told me before about short-circuiting the lookup but I didn't quite get I've been looking at this. It looks like short-circuiting lookup only possible in linear probing mode because it exploit that linear probing behavior. Therefore, we need to drop quadratic probing mode first before we can fully implement all improvement that is possible in Robin Hood, linear probing. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the > Also I think when probing for lookup we might consider trying the "start in I'm still thinking how to do single pass insert. From what I understand, we should tweak the insertion pattern that is using Iterator. For example, FindProbeRow(), FindBuildRowBucket() should commit early whether they indeed plan to do insert or just testing existence. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115 PS3, Line 115: new_distance <= target_d > Ah, I see what you mean. Somehow I was under impression that element that i Done http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: // swapped back into temporary location at index curr_idx. > Sorry, this example is bad, because the final balanced bucked should be [-1 This check, unfortunately, need to stay because it is possible to swap with empty bucket in the case of quadratic probing. We should iron this out if we decide to drop quadratic probing. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: } else { > Wonder if we should consider storing the bucket distance (calculated during Patch Set 4 address O(1) distance measurement for linear probing. For quadratic probing, because the probe steps follow Triangular number pattern, we supposedly can do ~O(1) by doing its inverse function https://en.wikipedia.org/wiki/Triangular_number#Triangular_roots_and_tests_for_triangular_numbers However, I'm leaning towards dropping quadratic probing entirely, so I keep it as loop for quadratic probing. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: DCHECK(node_ != NULL); > > And hash-table suppose to be thread-safe for read access. Add rebalancing at the end of Iterator::SetTuple and it pass core tests. -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 26 Mar 2020 01:25:58 +0000 Gerrit-HasComments: Yes