Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17592 )
Change subject: IMPALA-7635: Reducing HashTable size by packing it's buckets efficiently. ...................................................................... Patch Set 10: (43 comments) http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@9 PS9, Line 9: HashTable implementation in Impala comprises of contiguous array > nit. in Impala Done http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@10 PS9, Line 10: ns either data or pointer to : linked list of du > nit. "contains either data or a pointer to linked" Done http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@15 PS9, Line 15: bool matched; : DuplicateNode* next; : HtData htdata; : }; : : struct Bucket { : bool filled; : bool matched; : bool hasDuplicates; : uint32_t hash; : union { : HtData htdata; : DuplicateNode* duplicates; : } bucketData; : }; : > nit. The commit message is nice with many details. I wonder if it can be si It would be easier for reader to look at struct to digest the new and old size and where size reduction is coming from. http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@38 PS9, Line 38: e mos > nit. Intel Done http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@53 PS9, Line 53: sures siz > separately Done http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@58 PS9, Line 58: class to store a pointer and 7-bit tag together in 64 bit integer. > nit: a template class Done http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@59 PS9, Line 59: ownership of the pointer and w > nit. "tags together in 64-bit integers". i have instead changed 'pointers' to singular to avoid impression of containing multiple pointers at once. http://gerrit.cloudera.org:8080/#/c/17592/9//COMMIT_MSG@71 PS9, Line 71: med after building the Table. > nit. use lower cases. Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc File be/src/benchmarks/hash-table-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@47 PS9, Line 47: Hash Table Build: Function iters/ms 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile > nit. Formatting. The space under "HashT able Build" is wasted. Suggest to b This is common format for all benchmarks and this output is generated by common benchmark code for all other benchmarks. Moreover line is not wasted - it is to accommodate '(relative)' below. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@50 PS9, Line 50: 65536_100 > nit. Add a comment on the meaning of these two numbers. Line 40 outlines it already. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@254 PS9, Line 254: fo > I think we should return bool here to indicate whether there are any troubl I have converted it into CHECK now. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@261 PS9, Line 261: > nit. This can be defined outside the for loop. Status passed to 'insert' is not overwritten in all cases, hence a new one needs to be passed always. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@267 PS9, Line 267: // Clear old > nit. can we use name space htbenchmark here? They are now nested into htbenchmark namespace itself. There is separate namespace within htbenchmark for build and probe benchmarks to organise the methods they exclusively need. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/benchmarks/hash-table-benchmark.cc@279 PS9, Line 279: > nit. We should check the return status from Build(). its not needed now as Build has CHECK. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-ir.cc File be/src/exec/grouping-aggregator-ir.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-ir.cc@113 PS9, Line 113: e wil > nit: I wonder if instead of true/false we had an enum, then the code could Sure, Done. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-ir.cc@240 PS9, Line 240: This is called from ProcessBatchStreaming() so the rows are not aggregated. : Hash > nit: we could keep the old 'else if' and formatting so it is easier to spot Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-ir.cc@245 PS9, Line 245: se > nit: +2 indent Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-partition.cc File be/src/exec/grouping-aggregator-partition.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/grouping-aggregator-partition.cc@126 PS9, Line 126: ion = > nit: can we use an enum here as well? Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h File be/src/exec/hash-table.h: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@661 PS9, Line 661: > nit: this definition is not needed Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@689 PS9, Line 689: > nit: its Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@692 PS9, Line 692: tes whether > seems like it's not true anymore right. removed it. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@694 PS9, Line 694: it is ef > nit. this does not match the code at line 704. right. changed the comment. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@698 PS9, Line 698: > Tag bit 1? yes. changed it. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@713 PS9, Line 713: r > I wonder if why this argument is needed. Is it true that all hash tables ha No, hash tables don't have this tag. This tag is used to only set or update bucket data where client knows if data is tagged or not. it is required to reduce time complexity. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@723 PS9, Line 723: / If struct Bucket is > nit: this definition is not needed. Removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.h@732 PS9, Line 732: bucket_data.htdata.tuple = bd.GetTuple<TAGGED>(); : return buck > Seems like it's not true anymore Right, it's removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.cc File be/src/exec/hash-table.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/exec/hash-table.cc@579 PS9, Line 579: Status hash_allocation_status = > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/benchmark.cc File be/src/util/benchmark.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/benchmark.cc@143 PS9, Line 143: if (fn != NULL) fn(benchmarks_[0].args); > nit. May run Clang-format. {} is not needed if the entire IF statement is i Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/benchmark.cc@174 PS9, Line 174: if (fn != NULL) fn(benchmarks_[i].args); > nit. same as above. Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc File be/src/util/tagged-ptr-test.cc: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@27 PS9, Line 27: > nit const std::string& Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@42 PS9, Line 42: > nit. const std::string& Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@59 PS9, Line 59: > nit. I wonder if we use OWN=true version in BE. If so, you may test it well We use OWN=false in backend too. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@92 PS9, Line 92: EXPECT_FALSE(ptr.IsTagBitSet<i>()); : > nit. May include setting tag bit 3, 4,5 and 6. Have added above. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@97 PS9, Line 97: ch > nit. Please use 0x60 here to indicate which bits are set better. Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@102 PS9, Line 102: tS > nit same as above. Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr-test.cc@144 PS9, Line 144: > nit remove the space. Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h File be/src/util/tagged-ptr.h: http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@47 PS9, Line 47: /// allocation/deallocation of the object to which pointer stored points to should be > Instead of macros these could be template variables and template functions: Makes sense. This simplifies things. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@54 PS9, Line 54: > nit. Unset Done http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@46 PS9, Line 46: /// Derived class would be needed to define what each tag bit would mean or when : /// allocation/deallocation of the object to which pointer stored points to should be : /// the responsibility of client and not 'TaggedPtr'( check HashTable::TaggedBucketData). : : template <class T, bool OWNS = true> : class TaggedPtr { : public: : TaggedPtr() = default; : : template <class... Args> : static TaggedPtr make_tagptr(Args&&... args) { : T* ptr = new T(std::forward<Args>(args)...); : return TaggedPtr(ptr); : } : > nit. May run ClangFormat on the entire macro to align '\' nicely. Code Removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@99 PS9, Line 99: return data_ & DATA_MASK<bit>; > nit. tag bits 0-6. Code Removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@100 PS9, Line 100: } : : ALWAYS_INLINE bool IsNull() { return GetPtr() == 0; } : > nit. This can be moved to where the macro is defined (e.g. line 46). Code Removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@104 PS9, Line 104: ALWAYS_INLINE int GetTag() const { return data_ >> 57; } : : ALWAYS_INLINE T* GetPtr() const { return (T*)(data_ & MASK_0_56_BITS); } : : T& operator*() const noexcept { return *GetPtr(); } : > nit. remove? Code Removed. http://gerrit.cloudera.org:8080/#/c/17592/9/be/src/util/tagged-ptr.h@137 PS9, Line 137: > This can be made a constant to avoid compute ~ every time SetPtr() is calle Done -- To view, visit http://gerrit.cloudera.org:8080/17592 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b Gerrit-Change-Number: 17592 Gerrit-PatchSet: 10 Gerrit-Owner: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 18 Aug 2021 14:09:18 +0000 Gerrit-HasComments: Yes