Re: [PR] HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge… [hive]

via GitHub Tue, 28 Apr 2026 04:03:36 -0700


thomasrebele commented on code in PR #6443:
URL: https://github.com/apache/hive/pull/6443#discussion_r3153232111



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -102,38 +103,38 @@ private boolean shouldRewrite(ASTNode tree) {
     return rwt;
   }
 
+  private record StatsEligibleColumns(List<String> columnNames, List<String> 
columnTypes) {

Review Comment:
   Instead of creating a new type, could you please use `List<FieldSchema>`, 
which contains both the name and the type of the column?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -102,38 +103,38 @@ private boolean shouldRewrite(ASTNode tree) {
     return rwt;
   }
 
+  private record StatsEligibleColumns(List<String> columnNames, List<String> 
columnTypes) {
+  }
+
   /**
-   * Get the names of the columns that support column statistics.
+   * Get the names and types of the columns that support column statistics.
    */
-  private static List<String> getColumnNamesSupportingStats(Table tbl) {
+  private static StatsEligibleColumns getStatsEligibleColumns(Table tbl) {
     List<String> colNames = new ArrayList<>();
+    List<String> colTypes = new ArrayList<>();
     for (FieldSchema col : tbl.getCols()) {
       String type = col.getType();
       TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(type);
       boolean isSupported = 
ColumnStatsAutoGatherContext.isColumnSupported(typeInfo.getCategory(), () -> 
typeInfo);
       if (isSupported) {
         colNames.add(col.getName());
+        colTypes.add(col.getType());
       }
     }
-    return colNames;
+    return new StatsEligibleColumns(colNames, colTypes);
   }
 
   private List<String> getColumnName(ASTNode tree) throws SemanticException {

Review Comment:
   I would suggest to rename the function, maybe "getExplictColumnNames", 
though there may be a better name.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -217,27 +218,32 @@ private static String getColTypeOf(Table tbl, String 
partKey) {
     throw new RuntimeException("Unknown partition key : " + partKey);
   }
 
-  protected static List<String> getColumnTypes(Table tbl, List<String> 
colNames) {
+  protected static List<String> getColumnTypesByName(Table tbl, List<String> 
colNames) {
     List<String> colTypes = new ArrayList<>();
     List<FieldSchema> cols = tbl.getCols();
-    List<String> copyColNames = new ArrayList<>(colNames);
-
-    for (String colName : copyColNames) {
-      for (FieldSchema col : cols) {
-        if (colName.equalsIgnoreCase(col.getName())) {
-          String type = col.getType();
-          TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(type);
-          boolean isSupported = 
ColumnStatsAutoGatherContext.isColumnSupported(typeInfo.getCategory(), () -> 
typeInfo);
-          if (!isSupported) {
-            logTypeWarning(colName, type);
-            colNames.remove(colName);
-          } else {
-            colTypes.add(type);
-          }
+    Map<String, String> colTypeMap = new HashMap<>();
+
+    for (FieldSchema col : cols) {
+      colTypeMap.put(col.getName().toLowerCase(), col.getType());
+    }
+
+    List<String> primColNames = new ArrayList<>();
+    for (String colName : colNames) {
+      String type = colTypeMap.get(colName.toLowerCase());
+      if (type != null) {
+        TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(type);
+        boolean isSupported = 
ColumnStatsAutoGatherContext.isColumnSupported(typeInfo.getCategory(), () -> 
typeInfo);
+        if (!isSupported) {
+          logTypeWarning(colName, type);
+        } else {
+          primColNames.add(colName);
+          colTypes.add(type);
         }
       }
     }
 
+    colNames.clear();
+    colNames.addAll(primColNames);

Review Comment:
   Modifying the argument can be avoided when implementing my other comments.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -217,27 +218,32 @@ private static String getColTypeOf(Table tbl, String 
partKey) {
     throw new RuntimeException("Unknown partition key : " + partKey);
   }
 
-  protected static List<String> getColumnTypes(Table tbl, List<String> 
colNames) {
+  protected static List<String> getColumnTypesByName(Table tbl, List<String> 
colNames) {

Review Comment:
   I recommend to refactor getColumnTypesByName to return `List<FieldSchema>`.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -102,38 +103,38 @@ private boolean shouldRewrite(ASTNode tree) {
     return rwt;
   }
 
+  private record StatsEligibleColumns(List<String> columnNames, List<String> 
columnTypes) {
+  }
+
   /**
-   * Get the names of the columns that support column statistics.
+   * Get the names and types of the columns that support column statistics.
    */
-  private static List<String> getColumnNamesSupportingStats(Table tbl) {
+  private static StatsEligibleColumns getStatsEligibleColumns(Table tbl) {
     List<String> colNames = new ArrayList<>();
+    List<String> colTypes = new ArrayList<>();
     for (FieldSchema col : tbl.getCols()) {
       String type = col.getType();
       TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(type);
       boolean isSupported = 
ColumnStatsAutoGatherContext.isColumnSupported(typeInfo.getCategory(), () -> 
typeInfo);
       if (isSupported) {
         colNames.add(col.getName());
+        colTypes.add(col.getType());
       }
     }
-    return colNames;
+    return new StatsEligibleColumns(colNames, colTypes);
   }
 
   private List<String> getColumnName(ASTNode tree) throws SemanticException {
-
-    switch (tree.getChildCount()) {
-    case 2:
-      return getColumnNamesSupportingStats(tbl);
-    case 3:
-      int numCols = tree.getChild(2).getChildCount();
-      List<String> colName = new ArrayList<>(numCols);
-      for (int i = 0; i < numCols; i++) {
-        colName.add(getUnescapedName((ASTNode) tree.getChild(2).getChild(i)));
-      }
-      return colName;
-    default:
+    if (tree.getChildCount() != 3) {
       throw new SemanticException("Internal error. Expected number of children 
of ASTNode to be"
           + " either 2 or 3. Found : " + tree.getChildCount());

Review Comment:
   If we modify the method that way, the expected number of children is 3, so 
the exception message would need to be changed.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java:
##########
@@ -634,7 +640,13 @@ public void analyze(ASTNode ast, Context origCtx) throws 
SemanticException {
      */
     if (shouldRewrite(ast)) {
       tbl = AnalyzeCommandUtils.getTable(ast, this);
-      colNames = getColumnName(ast);
+      StatsEligibleColumns statsCols = null;
+      if (ast.getChildCount() == 2) {
+        statsCols = getStatsEligibleColumns(tbl);
+        colNames = statsCols.columnNames();
+      } else {
+        colNames = getColumnName(ast);
+      }

Review Comment:
   The handling of the AST should stay at once place to avoid code duplication 
here and in #rewriteAST. Maybe a new method `List<FieldSchema> 
getColumns(ASTNode)`. To keep the behavior the same, I would do roughly the 
following:
   
   1. Collect the column names as string using the original method
   2. Verify the names with checkForPartitionColumns and 
validateSpecifiedColumnNames (and removing the calls to these functions in 
ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
   3. Collect the columns as `List<FieldSchema>`
   4. The caller extracts the names (with 
`org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema`) and 
the types (I don't know of an existing function, at least I couldn't find one 
in Utilities).
   
   This approach avoids the need to modify the column names later, and should 
make the code easier to understand. It would be nice (if that optimization does 
not make the code too complex) to optimize the case `ast.getChildCount() == 2`, 
so that step 1 and 3 only collect the columns once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge… [hive]

Reply via email to