Dan Hecht has posted comments on this change. Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder ......................................................................
Patch Set 14: (19 comments) Flushing out some more comments, and will move onto the later patch set. One thing that's not clear to me yet is what the right separation between node and sink is -- they are currently very tightly coupled which means they don't really benefit from being separated into separate classes. Is the goal of this change solely to be able to expose the DataSink interface, or is it to provide some separation as well? Do we see them becoming less coupled over time? http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: PS14, Line 45: build row partitions is this the same as the "hash partitions" referred to in the next paragraph. let's be consistent in terminology. Line 86: is it expected that the renaming public functions would only be used by PHJN? If so, would be good to state that. PS14, Line 90: Acquire ownership Doesn't the ownership pass from this class to the exec node? In which case this method wouldn't acquire ownership. So let's come up with better terminology and naming. PS14, Line 95: Called after probing of the partitions : /// is done. The partitions are not closed or destroyed, since they may be spilled or : /// may contain unmatched build rows for certain join modes (e.g. right outer join). The fact that we need some much explanation kinda makes this seem like it might not be the right abstraction. Could we instead just clear the vector when creating the next set of partitions (it doesn't look like we delete partitions until either Close() or Reset() anyway), so why do we need to proactively clearly the hash_partitions_ vector before starting he next iteration? Line 106: } nit: missing blank line PS14, Line 109: When this call returns Is this stuff by this method or the caller? i.e. if this method closes input_partition, then say that. If it doesn't then either say "after" or don't say it at all (why is it relevant to this method documentation?). PS14, Line 112: so that the probe phase has enough buffers preallocated : /// execute successfully. slightly garbled. seems like a word is missing or something. PS14, Line 117: /// Iterates over all the partitions in 'hash_partitions_' and returns the largest : /// number of build rows. rather than say how and referring to private member, and leaving it to the reader to assume that "largest number of build rows" is talking about a per-partition thing, let's say something like: Returns the maximum number of build rows of a hash partition (or whatever terminology you choose in the class comment). Line 123: bool HashTableStoresNulls() const; why is this here rather than the join node? especially given that the join node owns the HashTableCtx. on a related note, I'm not really sure how the HashTable context is going to work out once this is driven like a real sync. Will we still have a parent_ link to the exec node living in a different fragment? Line 128: inline const std::vector<bool>& is_not_distinct_from() const { and then should this be here? is it to be closer to being able to break the parent_ link at some point and make the builder able to build the hash table without any info from parent_? PS14, Line 182: partition side of this partition not sure what this is saying. maybe just "for this build partition"? Line 188: /// false. maybe clarify that an error is not returned in that case. or clarify what cases are errors. etc. PS14, Line 195: unpin_all_build why "build"? shouldn't it be unpin_all_blocks? or even better would be to name it according to what the algorithm cares about. which I think is whether or not we might have more build rows to append, right? PS14, Line 234: build or probe does this care about the probe side partitioning? Line 237: /// 'null_aware_probe_partition_' and 'null_probe_rows_'. Also explain what this is computing, rather than just what the computation is. PS14, Line 244: all hash partitions for partitioning level this is kinda misleading. maybe qualify with ... for the current input, or the "current set of hash partitions for partition level...". i.e. this isn't all the hash partitions that are created for a particular level. PS14, Line 252: in into PS14, Line 256: in memory what does this mean? is it trying to say it fits in the current (pinned) block? http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.h File be/src/exec/partitioned-hash-join-node.h: PS14, Line 410: we then iterate over all the probe rows > Clarified that it's the partition's. Most of the text was just copied verba Thanks. From the perspective of the reviewer, the comment is already part of the patch (the code moved in non-trivial ways), so the reviewer has to read the comment. Since the reviewers need to read these comments, we might as well improve the comments now when we can make meaningful improvements since it's a relatively cheap way to improve readability (doesn't lead to additional debugging, testing, etc). -- To view, visit http://gerrit.cloudera.org:8080/3873 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes