zabetak commented on code in PR #4864:
URL: https://github.com/apache/hive/pull/4864#discussion_r1394239888
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T>
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
return found;
}
+ /**
+ * Check whether there are more operators in the specified operator tree
branch than the given limit
+ * @param start root of the operator tree to check
+ * @param opClazz type of operator to track
+ * @param limit maximum allowed number of operator in a branch of the tree
+ * @return true of limit is exceeded false otherwise
+ * @param <T> type of operator to track
+ */
+ public static <T> boolean hasMoreOperatorsThan(Operator<?> start, Class<T>
opClazz, int limit) {
Review Comment:
Method name, Javadoc, and implementation are not fully in sync.
* The fact that a `ReduceSinkOperator` stops the traversal is only
noticeable by looking into the implem of the method; what if someone tries to
pass `ReduceSinkOperator.class` as `opClazz`.
* The fact that we are only looking into the ancestors, not the descendants
is only noticeable by looking into the implem.
* "More...Than" implies that if we have 2 ops matching and the limit is set
to 2 the return should be `false` however according to the Javadoc and test
cases says otherwise.
In addition many other methods in this class are performing a traversal of
the operator graph in a similar fashion which gives me the feeling that we
could possibly reuse some stuff. How about having a more general method such as
the following and build on top of that?
```java
public static Stream<Operator<?>> ancestors(Operator<?> start,
Predicate<Operator<?>> stopCondition)
```
Or maybe adding a new `iterateParents` variant like:
```java
public static void iterateParents(Operator<?> start, Predicate<Operator<?>
stopCondition, Consumer<?> action);
```
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java:
##########
@@ -755,6 +757,20 @@ private boolean checkConvertJoinSMBJoin(JoinOperator
joinOp, OptimizeTezProcCont
LOG.debug("External table {} found in join and also could not provide
statistics - disabling SMB join.", sb);
return false;
}
+ // GBY operators buffers one record. These are processed when
ReduceRecordSources flushes the operator tree
+ // when end of record stream reached. If the tree has more than two GBY
operators CommonMergeJoinOperator can
+ // not process all buffered records.
+ // HIVE-27788
+ if (parentOp.getParentOperators() != null) {
+ for (Operator<?> grandParent : parentOp.getParentOperators()) {
Review Comment:
Why are we starting from the `grandParent` and not from the `parent`?
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/TestOperatorUtils.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.plan.CommonMergeJoinDesc;
+import org.apache.hadoop.hive.ql.plan.FilterDesc;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.LimitDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceSinkDesc;
+import org.apache.hadoop.hive.ql.plan.SelectDesc;
+import org.apache.hadoop.hive.ql.plan.TableScanDesc;
+import org.junit.jupiter.api.Test;
+
+import static java.util.Arrays.asList;
+import static java.util.Collections.singletonList;
+import static org.junit.jupiter.api.Assertions.*;
+
+class TestOperatorUtils {
+ @Test
+ void testHasMoreGBYsReturnsTrueWhenLimitIs0() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+ filter.setParentOperators(singletonList(limit));
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ limit.setParentOperators(singletonList(select));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ select.setParentOperators(singletonList(rs));
+
+ assertTrue(OperatorUtils.hasMoreOperatorsThan(filter,
GroupByOperator.class, 0));
Review Comment:
This result seems weird since the graph has no GroupByOperator? How can we
say that we exceeded the limit?
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T>
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
return found;
}
+ /**
+ * Check whether there are more operators in the specified operator tree
branch than the given limit
+ * @param start root of the operator tree to check
+ * @param opClazz type of operator to track
+ * @param limit maximum allowed number of operator in a branch of the tree
+ * @return true of limit is exceeded false otherwise
+ * @param <T> type of operator to track
+ */
+ public static <T> boolean hasMoreOperatorsThan(Operator<?> start, Class<T>
opClazz, int limit) {
+ int count = limit;
+ if (count <= 0) {
Review Comment:
According to this the method will always return true when limit is zero (or
negative) not matter what are the other inputs. Is this intended? This
basically says that there are more than zero operators of the specified
operator class in the graph without even checking the graph which doesn't seem
accurate.
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/TestOperatorUtils.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.plan.CommonMergeJoinDesc;
+import org.apache.hadoop.hive.ql.plan.FilterDesc;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.LimitDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceSinkDesc;
+import org.apache.hadoop.hive.ql.plan.SelectDesc;
+import org.apache.hadoop.hive.ql.plan.TableScanDesc;
+import org.junit.jupiter.api.Test;
+
+import static java.util.Arrays.asList;
+import static java.util.Collections.singletonList;
+import static org.junit.jupiter.api.Assertions.*;
+
+class TestOperatorUtils {
+ @Test
+ void testHasMoreGBYsReturnsTrueWhenLimitIs0() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+ filter.setParentOperators(singletonList(limit));
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ limit.setParentOperators(singletonList(select));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ select.setParentOperators(singletonList(rs));
+
+ assertTrue(OperatorUtils.hasMoreOperatorsThan(filter,
GroupByOperator.class, 0));
+ }
+
+ @Test
+ void testHasMoreGBYsReturnsFalseWhenNoGBYInBranchAndLimitIsMoreThan0() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+ filter.setParentOperators(singletonList(limit));
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ limit.setParentOperators(singletonList(select));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ select.setParentOperators(singletonList(rs));
+
+ assertFalse(OperatorUtils.hasMoreOperatorsThan(filter,
GroupByOperator.class, 2));
+ }
+
+ @Test
+ void testHasMoreGBYsReturnsFalseWhenNumberOfGBYIsLessThanLimit() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+ filter.setParentOperators(singletonList(limit));
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ limit.setParentOperators(singletonList(select));
+ Operator<?> gby = OperatorFactory.get(context, GroupByDesc.class);
+ select.setParentOperators(singletonList(gby));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ gby.setParentOperators(singletonList(rs));
+
+ assertFalse(OperatorUtils.hasMoreOperatorsThan(filter,
GroupByOperator.class, 2));
+ }
+
+ @Test
+ void testHasMoreGBYsReturnsTrueWhenNumberOfGBYIsEqualsWithLimit() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ gby1.setParentOperators(singletonList(select));
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ select.setParentOperators(singletonList(filter));
+ Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+ filter.setParentOperators(singletonList(gby2));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ gby2.setParentOperators(singletonList(rs));
+
+ assertTrue(OperatorUtils.hasMoreOperatorsThan(gby1, GroupByOperator.class,
2));
+ }
+
+ @Test
+ void
testHasMoreGBYsReturnsFalseWhenNumberOfGBYIsEqualsWithLimitButHasAnRSInTheMiddle()
{
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ gby1.setParentOperators(singletonList(select));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ select.setParentOperators(singletonList(rs));
+ Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+ rs.setParentOperators(singletonList(gby2));
+ Operator<?> ts = OperatorFactory.get(context, TableScanDesc.class);
+ gby2.setParentOperators(singletonList(ts));
+
+ assertFalse(OperatorUtils.hasMoreOperatorsThan(gby1,
GroupByOperator.class, 2));
+ }
+
+ @Test
+ void
testHasMoreGBYsReturnsTrueWhenBranchHasJoinAndNumberOfGBYIsEqualsWithLimit() {
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+ Operator<?> join = OperatorFactory.get(context, CommonMergeJoinDesc.class);
+ gby1.setParentOperators(singletonList(join));
+
+ // Branch #1 has the second GBY
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+ filter.setParentOperators(singletonList(gby2));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ gby2.setParentOperators(singletonList(rs));
+
+ // Branch #2
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ Operator<?> rs2 = OperatorFactory.get(context, ReduceSinkDesc.class);
+ select.setParentOperators(singletonList(rs2));
+
+ join.setParentOperators(asList(filter, select));
+
+ assertTrue(OperatorUtils.hasMoreOperatorsThan(gby1, GroupByOperator.class,
2));
+ }
+
+ @Test
+ void
testHasMoreGBYsReturnsFalseWhenBranchHasJoinAndBothJoinBranchesHasLessGBYThanLimit()
{
+ CompilationOpContext context = new CompilationOpContext();
+ Operator<?> join = OperatorFactory.get(context, CommonMergeJoinDesc.class);
+
+ // Branch #1 has the second GBY
+ Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+ Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+ filter.setParentOperators(singletonList(gby1));
+ Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+ gby1.setParentOperators(singletonList(rs));
+
+ // Branch #2
+ Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+ Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+ select.setParentOperators(singletonList(gby2));
+ Operator<?> rs2 = OperatorFactory.get(context, ReduceSinkDesc.class);
+ gby2.setParentOperators(singletonList(rs2));
+
+ join.setParentOperators(asList(filter, select));
+
+ assertFalse(OperatorUtils.hasMoreOperatorsThan(join,
GroupByOperator.class, 2));
Review Comment:
This is also a bit surprising as a result.
```
RS - GBY2 - FIL - \
JOIN
RS - GBY1 - FIL - /
```
The JOIN tree has two GBY operators so why should this method return false?
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T>
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
return found;
}
+ /**
+ * Check whether there are more operators in the specified operator tree
branch than the given limit
+ * @param start root of the operator tree to check
+ * @param opClazz type of operator to track
+ * @param limit maximum allowed number of operator in a branch of the tree
+ * @return true of limit is exceeded false otherwise
Review Comment:
typo: of -> if
##########
ql/src/test/queries/clientpositive/auto_sortmerge_join_17.q:
##########
@@ -0,0 +1,20 @@
+CREATE TABLE tbl1_n5(key int, value string) CLUSTERED BY (key) SORTED BY (key)
INTO 2 BUCKETS;
+
+insert into tbl1_n5(key, value)
+values
+(0, 'val_0'),
+(2, 'val_2'),
+(9, 'val_9');
+
+explain
+SELECT t1.key from
+(SELECT key , row_number() over(partition by key order by value desc) as rk
from tbl1_n5) t1
+join
+( SELECT key,count(distinct value) as cp_count from tbl1_n5 group by key) t2
+on t1.key = t2.key where rk = 1;
+
+SELECT t1.key from
+(SELECT key , row_number() over(partition by key order by value desc) as rk
from tbl1_n5) t1
Review Comment:
As mentioned in the JIRA ticket its worth clarifying if the PTF is necessary
to trigger the problem. If not then we should minimize the repro.
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T>
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
return found;
}
+ /**
+ * Check whether there are more operators in the specified operator tree
branch than the given limit
+ * @param start root of the operator tree to check
+ * @param opClazz type of operator to track
+ * @param limit maximum allowed number of operator in a branch of the tree
+ * @return true of limit is exceeded false otherwise
+ * @param <T> type of operator to track
Review Comment:
The type variable `T` seems redundant; the method is not generic.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]