Re: [PR] HIVE-27788: Exception in Sort Merge join with Group By + PTF Operator [hive]

via GitHub Wed, 15 Nov 2023 07:21:33 -0800


zabetak commented on code in PR #4864:
URL: https://github.com/apache/hive/pull/4864#discussion_r1394239888



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T> 
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
     return found;
   }
 
+  /**
+   * Check whether there are more operators in the specified operator tree 
branch than the given limit
+   * @param start root of the operator tree to check
+   * @param opClazz type of operator to track
+   * @param limit maximum allowed number of operator in a branch of the tree
+   * @return true of limit is exceeded false otherwise
+   * @param <T> type of operator to track
+   */
+  public static <T> boolean hasMoreOperatorsThan(Operator<?> start, Class<T> 
opClazz, int limit) {

Review Comment:
   Method name, Javadoc, and implementation are not fully in sync.
   * The fact that a `ReduceSinkOperator` stops the traversal is only 
noticeable by looking into the implem of the method; what if someone tries to 
pass `ReduceSinkOperator.class` as `opClazz`.
   * The fact that we are only looking into the ancestors, not the descendants 
is only noticeable by looking into the implem.
   * "More...Than" implies that if we have 2 ops matching and the limit is set 
to 2 the return should be `false` however according to the Javadoc and test 
cases says otherwise.
   
   In addition many other methods in this class are performing a traversal of 
the operator graph in a similar fashion which gives me the feeling that we 
could possibly reuse some stuff. How about having a more general method such as 
the following and build on top of that?
   ```java
   public static Stream<Operator<?>> ancestors(Operator<?> start, 
Predicate<Operator<?>> stopCondition)
   ```
   Or maybe adding a new `iterateParents` variant like:
   ```java
   public static void iterateParents(Operator<?> start, Predicate<Operator<?> 
stopCondition, Consumer<?> action);
   ```



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java:
##########
@@ -755,6 +757,20 @@ private boolean checkConvertJoinSMBJoin(JoinOperator 
joinOp, OptimizeTezProcCont
         LOG.debug("External table {} found in join and also could not provide 
statistics - disabling SMB join.", sb);
         return false;
       }
+      // GBY operators buffers one record. These are processed when 
ReduceRecordSources flushes the operator tree
+      // when end of record stream reached. If the tree has more than two GBY 
operators CommonMergeJoinOperator can
+      // not process all buffered records.
+      // HIVE-27788
+      if (parentOp.getParentOperators() != null) {
+        for (Operator<?> grandParent : parentOp.getParentOperators()) {

Review Comment:
   Why are we starting from the `grandParent` and not from the `parent`?



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/TestOperatorUtils.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.plan.CommonMergeJoinDesc;
+import org.apache.hadoop.hive.ql.plan.FilterDesc;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.LimitDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceSinkDesc;
+import org.apache.hadoop.hive.ql.plan.SelectDesc;
+import org.apache.hadoop.hive.ql.plan.TableScanDesc;
+import org.junit.jupiter.api.Test;
+
+import static java.util.Arrays.asList;
+import static java.util.Collections.singletonList;
+import static org.junit.jupiter.api.Assertions.*;
+
+class TestOperatorUtils {
+  @Test
+  void testHasMoreGBYsReturnsTrueWhenLimitIs0() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+    filter.setParentOperators(singletonList(limit));
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    limit.setParentOperators(singletonList(select));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    select.setParentOperators(singletonList(rs));
+
+    assertTrue(OperatorUtils.hasMoreOperatorsThan(filter, 
GroupByOperator.class, 0));

Review Comment:
   This result seems weird since the graph has no GroupByOperator? How can we 
say that we exceeded the limit?



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T> 
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
     return found;
   }
 
+  /**
+   * Check whether there are more operators in the specified operator tree 
branch than the given limit
+   * @param start root of the operator tree to check
+   * @param opClazz type of operator to track
+   * @param limit maximum allowed number of operator in a branch of the tree
+   * @return true of limit is exceeded false otherwise
+   * @param <T> type of operator to track
+   */
+  public static <T> boolean hasMoreOperatorsThan(Operator<?> start, Class<T> 
opClazz, int limit) {
+    int count = limit;
+    if (count <= 0) {

Review Comment:
   According to this the method will always return true when limit is zero (or 
negative) not matter what are the other inputs. Is this intended? This 
basically says that there are more than zero operators of the specified 
operator class in the graph without even checking the graph which doesn't seem 
accurate.



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/TestOperatorUtils.java:
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.plan.CommonMergeJoinDesc;
+import org.apache.hadoop.hive.ql.plan.FilterDesc;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;
+import org.apache.hadoop.hive.ql.plan.LimitDesc;
+import org.apache.hadoop.hive.ql.plan.ReduceSinkDesc;
+import org.apache.hadoop.hive.ql.plan.SelectDesc;
+import org.apache.hadoop.hive.ql.plan.TableScanDesc;
+import org.junit.jupiter.api.Test;
+
+import static java.util.Arrays.asList;
+import static java.util.Collections.singletonList;
+import static org.junit.jupiter.api.Assertions.*;
+
+class TestOperatorUtils {
+  @Test
+  void testHasMoreGBYsReturnsTrueWhenLimitIs0() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+    filter.setParentOperators(singletonList(limit));
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    limit.setParentOperators(singletonList(select));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    select.setParentOperators(singletonList(rs));
+
+    assertTrue(OperatorUtils.hasMoreOperatorsThan(filter, 
GroupByOperator.class, 0));
+  }
+
+  @Test
+  void testHasMoreGBYsReturnsFalseWhenNoGBYInBranchAndLimitIsMoreThan0() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+    filter.setParentOperators(singletonList(limit));
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    limit.setParentOperators(singletonList(select));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    select.setParentOperators(singletonList(rs));
+
+    assertFalse(OperatorUtils.hasMoreOperatorsThan(filter, 
GroupByOperator.class, 2));
+  }
+
+  @Test
+  void testHasMoreGBYsReturnsFalseWhenNumberOfGBYIsLessThanLimit() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> limit = OperatorFactory.get(context, LimitDesc.class);
+    filter.setParentOperators(singletonList(limit));
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    limit.setParentOperators(singletonList(select));
+    Operator<?> gby = OperatorFactory.get(context, GroupByDesc.class);
+    select.setParentOperators(singletonList(gby));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    gby.setParentOperators(singletonList(rs));
+
+    assertFalse(OperatorUtils.hasMoreOperatorsThan(filter, 
GroupByOperator.class, 2));
+  }
+
+  @Test
+  void testHasMoreGBYsReturnsTrueWhenNumberOfGBYIsEqualsWithLimit() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    gby1.setParentOperators(singletonList(select));
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    select.setParentOperators(singletonList(filter));
+    Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+    filter.setParentOperators(singletonList(gby2));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    gby2.setParentOperators(singletonList(rs));
+
+    assertTrue(OperatorUtils.hasMoreOperatorsThan(gby1, GroupByOperator.class, 
2));
+  }
+
+  @Test
+  void 
testHasMoreGBYsReturnsFalseWhenNumberOfGBYIsEqualsWithLimitButHasAnRSInTheMiddle()
 {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    gby1.setParentOperators(singletonList(select));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    select.setParentOperators(singletonList(rs));
+    Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+    rs.setParentOperators(singletonList(gby2));
+    Operator<?> ts = OperatorFactory.get(context, TableScanDesc.class);
+    gby2.setParentOperators(singletonList(ts));
+
+    assertFalse(OperatorUtils.hasMoreOperatorsThan(gby1, 
GroupByOperator.class, 2));
+  }
+
+  @Test
+  void 
testHasMoreGBYsReturnsTrueWhenBranchHasJoinAndNumberOfGBYIsEqualsWithLimit() {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+    Operator<?> join = OperatorFactory.get(context, CommonMergeJoinDesc.class);
+    gby1.setParentOperators(singletonList(join));
+
+    // Branch #1 has the second GBY
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+    filter.setParentOperators(singletonList(gby2));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    gby2.setParentOperators(singletonList(rs));
+
+    // Branch #2
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    Operator<?> rs2 = OperatorFactory.get(context, ReduceSinkDesc.class);
+    select.setParentOperators(singletonList(rs2));
+
+    join.setParentOperators(asList(filter, select));
+
+    assertTrue(OperatorUtils.hasMoreOperatorsThan(gby1, GroupByOperator.class, 
2));
+  }
+
+  @Test
+  void 
testHasMoreGBYsReturnsFalseWhenBranchHasJoinAndBothJoinBranchesHasLessGBYThanLimit()
 {
+    CompilationOpContext context = new CompilationOpContext();
+    Operator<?> join = OperatorFactory.get(context, CommonMergeJoinDesc.class);
+
+    // Branch #1 has the second GBY
+    Operator<?> filter = OperatorFactory.get(context, FilterDesc.class);
+    Operator<?> gby1 = OperatorFactory.get(context, GroupByDesc.class);
+    filter.setParentOperators(singletonList(gby1));
+    Operator<?> rs = OperatorFactory.get(context, ReduceSinkDesc.class);
+    gby1.setParentOperators(singletonList(rs));
+
+    // Branch #2
+    Operator<?> select = OperatorFactory.get(context, SelectDesc.class);
+    Operator<?> gby2 = OperatorFactory.get(context, GroupByDesc.class);
+    select.setParentOperators(singletonList(gby2));
+    Operator<?> rs2 = OperatorFactory.get(context, ReduceSinkDesc.class);
+    gby2.setParentOperators(singletonList(rs2));
+
+    join.setParentOperators(asList(filter, select));
+
+    assertFalse(OperatorUtils.hasMoreOperatorsThan(join, 
GroupByOperator.class, 2));

Review Comment:
   This is also a bit surprising as a result.
   ```
   RS - GBY2 - FIL - \
                     JOIN
   RS - GBY1 - FIL - /
   ```
   The JOIN tree has two GBY operators so why should this method return false?



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T> 
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
     return found;
   }
 
+  /**
+   * Check whether there are more operators in the specified operator tree 
branch than the given limit
+   * @param start root of the operator tree to check
+   * @param opClazz type of operator to track
+   * @param limit maximum allowed number of operator in a branch of the tree
+   * @return true of limit is exceeded false otherwise

Review Comment:
   typo: of -> if



##########
ql/src/test/queries/clientpositive/auto_sortmerge_join_17.q:
##########
@@ -0,0 +1,20 @@
+CREATE TABLE tbl1_n5(key int, value string) CLUSTERED BY (key) SORTED BY (key) 
INTO 2 BUCKETS;
+
+insert into tbl1_n5(key, value)
+values
+(0, 'val_0'),
+(2, 'val_2'),
+(9, 'val_9');
+
+explain
+SELECT t1.key from
+(SELECT  key , row_number() over(partition by key order by value desc) as rk 
from tbl1_n5) t1
+join
+( SELECT key,count(distinct value) as cp_count from tbl1_n5 group by key) t2
+on t1.key = t2.key where rk = 1;
+
+SELECT t1.key from
+(SELECT  key , row_number() over(partition by key order by value desc) as rk 
from tbl1_n5) t1

Review Comment:
   As mentioned in the JIRA ticket its worth clarifying if the PTF is necessary 
to trigger the problem. If not then we should minimize the repro.



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java:
##########
@@ -176,6 +176,37 @@ public static <T> Set<T> 
findOperatorsUpstreamJoinAccounted(Operator<?> start, C
     return found;
   }
 
+  /**
+   * Check whether there are more operators in the specified operator tree 
branch than the given limit
+   * @param start root of the operator tree to check
+   * @param opClazz type of operator to track
+   * @param limit maximum allowed number of operator in a branch of the tree
+   * @return true of limit is exceeded false otherwise
+   * @param <T> type of operator to track

Review Comment:
   The type variable `T` seems redundant; the method is not generic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27788: Exception in Sort Merge join with Group By + PTF Operator [hive]

Reply via email to