Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/17983 )
Change subject: IMPALA-10920: Zipping unnest for arrays ...................................................................... Patch Set 5: (43 comments) http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h File be/src/exec/unnest-node.h: http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@40 PS2, Line 40: SlotRefs > If they are always SlotRefs, can't the type of the vector be std::vector<Sl This vector is passed to ScalarExpr::Create() that populates it, but that Create() function receives a vector of ScalarExpr pointers so have to keep them as they are now. http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@42 PS2, Line 42: 'UnnestNode::coll_values_', but instead manually retrieve the slot values to support : /// projection (see cla > I think it should be mentioned that 'coll_values_' and the class comment ar Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@46 PS2, Line 46: UnnestPlanNode. > 'coll_expr_evals_' has been removed. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@46 PS2, Line 46: t in : /// InitCollExpr(). > It seems no longer to be true that they are set in Prepare() and set to NUL Right, these are set to null in a different class not here. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@50 PS2, Line 50: Set in InitCollE > It seems no longer to be the case. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@109 PS2, Line 109: . > Nit: belong. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@116 PS2, Line 116: doesn > Nit: doesn't. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@126 PS2, Line 126: These slots are always set to NULL in : /// Open() as a simple projection. > Does not seem to be true anymore. In fact it's partially true. These still are projected to NULL but not set in Prepare. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@131 PS2, Line 131: > Does not seem to be true. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.h@144 PS2, Line 144: length > typo Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc File be/src/exec/unnest-node.cc: http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc@60 PS2, Line 60: DCHECK(coll_expr->IsSlotRef( > This DCHECK is superfluous because slot_ref==null iff coll_expr==nullptr, a Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc@90 PS2, Line 90: num_collections_counter_(nullptr) { > nit: usually the constant is the second argument. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc@143 PS2, Line 143: if (num_tuples > 0) { > nit: here and at line 146: using std::max would be easier to read IMO Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc@174 PS2, Line 174: > Nit: should be collections. Done http://gerrit.cloudera.org:8080/#/c/17983/2/be/src/exec/unnest-node.cc@218 PS2, Line 218: (*coll_slot_descs_)[coll_idx]->children_tuple_de > Can't we use nullptr everywhere? My understanding is that nullptr as tuple As I remember I had to create the tuple in the row batch and set it to null explicitly otherwise there is a point somewhere later where Impala crashes. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/cup/sql-parser.cup@315 PS2, Line 315: KW_UNNEST > We should add to the comment that we have added a new keyword, making this Mentioned it in the commit message. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/cup/sql-parser.cup@3030 PS2, Line 3030: Providing multiple UNNEST() in the FROM clause is not supported. > It this something that we want to support later? It depends on what semantics you want to achieve with multiple unnests. In case you want to do zipping unnest between the arrays in the same unnest but do joining unnest with the ones on different unnest then this might be feasible, but a bit hard to understand. And anyway zipping and joining unnest is not allowed together atm. I'm not really convinced we want to support this later as I see no added value. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/cup/sql-parser.cup@3091 PS2, Line 3091: > What if the user provides more aliases than paths? Indeed, thx :) Added some tests as well to cover this. Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@574 PS2, Line 574: tableRefsFromUnnestExpr > Nit: Should end in an underscore (_). Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@706 PS2, Line 706: return aliasMap_.containsKey(uniqueAlias); : } > Nit: 'return aliasMap_.containsKey(uniqueAlias)' would be simpler. Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@966 PS2, Line 966: Expr; > Isn't this a bug at this point if the type is not a SlotRef? Shouldn't we a I found a use case where collExpr is not a SlotRef. I used to have a Precondition here but that broke. Frankly, I don't remember what the issue was exactly, but for instance this query would break the Precondition: select arr1.item from unnest(functional_parquet.complextypes_arrays.arr1) where arr1.item < 2; http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@389 PS2, Line 389: typed columns because " > Does this make it necessary to re-analyse 'fromClause_'? Yes, a re-analysis is required. In PS3 I did a change that forces re-analysis in this case to fix a known bug. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java File fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@31 PS2, Line 31: represent > Could we have a class comment saying that this is for zipping unnests in th Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@36 PS2, Line 36: > Don't we have to call removeItemFromPath() here? What if 'path' already con The path has to point to an array and not to it's items. I path contains an "item" already that means that the underlying data structure is an array<array<something>> so it's fin to add another "item" at the end so that the path will point to the item of the inner array. Added a test to cover an error when "unnest(arr1.item)" and arr1.item not an array. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@42 PS2, Line 42: > Can we actually unnest non-top-level arrays? Even if not, I think this may Agreed. Meanwhile I removed the support for arr.item where it is not an array itself. I think it happened in PS3. About nested arrays: It is not supported atm to unnest a nested array as arr.item. I'm not sure we want to do that, or should we open a jira to implement later. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@68 PS2, Line 68: // 'rawPath_' points to an array and we need a SlotRef to refer to the item of the > Have we ensured that 'rawPath_' doesn't already contain "item"? See comment In resolveAndVerifyRawPath() we check that the given path points to an array. This is enough to check in my opinion. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@142 PS2, Line 142: Preconditions.checkNotNull(rawPath_); > Why don't we just remove the last element from the list? Do we want to keep It's fine to remove the item. I'm not sure why I've done it this way. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@146 PS2, Line 146: } : : @Override : p > Can't we just inherit this method without writing it out? Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/analysis/UnnestExpr.java@154 PS2, Line 154: } : } : : > Can't we just inherit this method without writing it out? Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/PlanNode.java File fe/src/main/java/org/apache/impala/planner/PlanNode.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/PlanNode.java@164 PS2, Line 164: for (CollectionTableRef collRef : tblRefs) { : tupleIds_.add(collRef.getDesc().getId()); : tblRefIds_.add(collRef.getDesc().getId()); : } > Couldn't we to this in the UnnestNode's constructor instead and avoid addin The reason I do it here is that I have to pass a List of CollectionTableRef to UnnestNode, but then I won't be able to extract any tuple IDs from that list before calling the PlanNode's constructor as the "super(...)" call has to be the first instruction in a constructor. I could pass all these IDs to the UnnestNode constructor in a separate parameter but that seemed redundant for me as they are already present in the CollectionTableRefs. http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1047 PS2, Line 1047: the > nit: extra "the" Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1060 PS2, Line 1060: refs.removeAll(reducedCollectionRefs); > Dumb question: is it sure that 'refs' doesn't contain duplicates? Can it be You can't provide the same collection multiple times in the unnest so here you won't end up having duplicates. select arr1.item from functional_parquet.arr_tbl t, UNNEST(t.arr1, t.arr1); AnalysisException: Duplicate table alias: 'arr1' When giving in select list, you can give the same array multiple times but it doesn't cause any trouble because only a single CollectionTableRef is created for them in the background: select unnest(arr1), unnest(arr1) from functional_parquet.arr_tbl; +------+------+ | item | item | +------+------+ | 1 | 1 | | 2 | 2 | +------+------+ http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@2175 PS2, Line 2175: ning of the gi > Could you document what 'collectionRefs' is used for? Done http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/UnnestNode.java File fe/src/main/java/org/apache/impala/planner/UnnestNode.java: http://gerrit.cloudera.org:8080/#/c/17983/2/fe/src/main/java/org/apache/impala/planner/UnnestNode.java@74 PS2, Line 74: super.init(analyzer); : conjuncts_ = orderConjunctsByCost(conjuncts_); : > This doesn't seem true anymore, right? I guess it isn't :) Comment removed. http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/data/README@662 PS2, Line 662: bles > nit: double "to" Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/data/README@662 PS2, Line 662: The purpose of introducing these tables > Nit: "The purpose OF introducing THESE tableS IS to ..." Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test File testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test: http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test@91 PS2, Line 91: the first one has some null > The string array doesn't have NULL items in this test. Is it ok? The comment is misleading. Only arr1 has nulls. Fixed the comment. http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test@183 PS2, Line 183: ==== > Would it add value if we also tested the combination (NULL, empty)? >From the UnnestNode's perspective having a NULL or empty array as an input is >the same. I think it is fine not to cover that. http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test@222 PS2, Line 222: doesn't wo > Nit: doesn't work Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-from-clause.test@311 PS2, Line 311: : > Nit: a colon (:) would be better in the error message. Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-select-list.test File testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-select-list.test: http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-select-list.test@55 PS2, Line 55: > Nit: q Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-select-list.test@76 PS2, Line 76: omp > Nit: delete 'an'. Done http://gerrit.cloudera.org:8080/#/c/17983/2/testdata/workloads/functional-query/queries/QueryTest/zipping-unnest-in-select-list.test@138 PS2, Line 138: ---- QUERY : # Struct type is given for the unnest operator. > The difference from the previous test is only the array in the FROM clause. I wanted to differentiate between these two use cases: 1: unnesting the same array both in the select list and in the from clause 2: have an unnest both in the select list and from clause but not for the same arrays. -- To view, visit http://gerrit.cloudera.org:8080/17983 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic58ff6579ecff03962e7a8698edfbe0684ce6cf7 Gerrit-Change-Number: 17983 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Nov 2021 14:48:12 +0000 Gerrit-HasComments: Yes