Re: [PR] HIVE-27794: Iceberg: Implement Copy-On-Write for Merge queries [hive]

via GitHub Tue, 21 Nov 2023 00:21:44 -0800


kasakrisz commented on code in PR #4852:
URL: https://github.com/apache/hive/pull/4852#discussion_r1400180944



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/rewrite/MergeRewriter.java:
##########
@@ -217,13 +218,17 @@ public void 
appendWhenMatchedUpdateClause(MergeStatement.UpdateClause updateClau
     }
 
     protected void addValues(Table targetTable, String targetAlias, 
Map<String, String> newValues,
-                             List<String> values) {
+                             List<String> values, boolean aliasRhsExpr) {
       for (FieldSchema fieldSchema : targetTable.getCols()) {
+        String value = String.format("%s.%s", targetAlias, 
HiveUtils.unparseIdentifier(fieldSchema.getName(), conf));
         if (newValues.containsKey(fieldSchema.getName())) {
-          values.add(newValues.get(fieldSchema.getName()));
+          String rhsExp = newValues.get(fieldSchema.getName());
+          if (aliasRhsExpr){
+            rhsExp += String.format(" AS %s", value);
+          }
+          values.add(rhsExp);

Review Comment:
   How about 
   ```
   MergeRwriter.addValues(...) {
         for (FieldSchema fieldSchema : targetTable.getCols()) {
           String quotedColumnName = String.format("%s.%s", targetAlias,
               HiveUtils.unparseIdentifier(fieldSchema.getName(), conf));
           if (newValues.containsKey(fieldSchema.getName())) {
             values.add(getValue(newValues.get(fieldSchema.getName()), 
quotedColumnName));
           } else {
             values.add(quotedColumnName);
           }
         }
   }
   
   protected String MergeRwriter.getValue(String newValue, String alias) {
     return newValue;
   }
   ```
   ```
   @Override
   protected String CopyOnWriteMergeRewriter.getValue(String newValue, String 
alias) {
     return String.format("%s AS %s", newValue, alias);
   }
   ```
   every code part is on the place where it's suppose to be and the logic in 
`MergeRwriter.addValues` is less complex: `if (aliasRhsExpr){` and a method 
parameter is removed.
   
   OR
   
   If `it won't harm if we add alias in existing implementations` then why 
don't we add it always. It also means that the boolean parameter `aliasRhsExpr` 
can be removed.
   



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/rewrite/CopyOnWriteUpdateRewriter.java:
##########
@@ -45,40 +45,51 @@ public class CopyOnWriteUpdateRewriter implements 
Rewriter<UpdateStatement> {
   private final SetClausePatcher setClausePatcher;
 
 
-  public CopyOnWriteUpdateRewriter(HiveConf conf, SqlGeneratorFactory 
sqlGeneratorFactory,
-                                   COWWithClauseBuilder cowWithClauseBuilder, 
SetClausePatcher setClausePatcher) {
+  public CopyOnWriteUpdateRewriter(HiveConf conf, SqlGeneratorFactory 
sqlGeneratorFactory) {
     this.conf = conf;
     this.sqlGeneratorFactory = sqlGeneratorFactory;
-    this.cowWithClauseBuilder = cowWithClauseBuilder;
-    this.setClausePatcher = setClausePatcher;
+    this.cowWithClauseBuilder = new COWWithClauseBuilder();
+    this.setClausePatcher = new SetClausePatcher();
   }
 
   @Override
   public ParseUtils.ReparseResult rewrite(Context context, UpdateStatement 
updateBlock)
       throws SemanticException {
 
-    Tree wherePredicateNode = updateBlock.getWhereTree().getChild(0);
-    String whereClause = context.getTokenRewriteStream().toString(
-        wherePredicateNode.getTokenStartIndex(), 
wherePredicateNode.getTokenStopIndex());
     String filePathCol = 
HiveUtils.unparseIdentifier(VirtualColumn.FILE_PATH.getName(), conf);
-
     MultiInsertSqlGenerator sqlGenerator = 
sqlGeneratorFactory.createSqlGenerator();
 
-    cowWithClauseBuilder.appendWith(sqlGenerator, filePathCol, whereClause);
-
-    sqlGenerator.append("insert into table ");
+    String whereClause = null;
+    int columnOffset = 0;
+    
+    boolean shouldOverwrite = updateBlock.getWhereTree() == null;

Review Comment:
   1. Sorry but I don't see a strong connection between implementing COW merge 
and enabling COW support for v1 and v2. Enabling COW support for v1 seems to me 
a 3rd patch. :) I prefer split this big work to separate patches.
   2. Dispatching the implementations can be moved to 
`CopyOnWriteUpdateRewriter.rewrite`
   ```
   if (updateBlock.getWhereTree() == null) {
     return new CopyOnInsertOverWriteUpdateRewriter.rewrite(...);
   } else {
     return new CopyOnInsertUpdateRewriter.rewrite(...);
   }
   ```
   And `CopyOnInsertUpdateRewriter` can extend 
`CopyOnInsertOverWriteUpdateRewriter` or vice versa to reuse the common parts.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/rewrite/CopyOnWriteMergeRewriter.java:
##########
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.parse.rewrite;
+
+import com.google.common.base.Strings;
+import com.google.common.collect.Lists;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.Context;
+import org.apache.hadoop.hive.ql.metadata.Hive;
+import org.apache.hadoop.hive.ql.metadata.HiveUtils;
+import org.apache.hadoop.hive.ql.metadata.Table;
+import org.apache.hadoop.hive.ql.metadata.VirtualColumn;
+import org.apache.hadoop.hive.ql.parse.ParseUtils;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.parse.rewrite.sql.COWWithClauseBuilder;
+import org.apache.hadoop.hive.ql.parse.rewrite.sql.MultiInsertSqlGenerator;
+import org.apache.hadoop.hive.ql.parse.rewrite.sql.SqlGeneratorFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Optional;
+import java.util.function.UnaryOperator;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static 
org.apache.hadoop.hive.ql.parse.rewrite.sql.SqlGeneratorFactory.TARGET_PREFIX;
+
+public class CopyOnWriteMergeRewriter extends MergeRewriter {
+
+  public CopyOnWriteMergeRewriter(Hive db, HiveConf conf, SqlGeneratorFactory 
sqlGeneratorFactory) {
+    super(db, conf, sqlGeneratorFactory);
+  }
+
+  @Override
+  public ParseUtils.ReparseResult rewrite(Context ctx, MergeStatement 
mergeStatement) throws SemanticException {
+
+    setOperation(ctx);
+    MultiInsertSqlGenerator sqlGenerator = 
sqlGeneratorFactory.createSqlGenerator();
+    handleSource(mergeStatement, sqlGenerator);
+
+    sqlGenerator.append('\n');
+    sqlGenerator.append("INSERT INTO ").appendTargetTableName();
+    sqlGenerator.append('\n');
+    
+    List<MergeStatement.WhenClause> whenClauses = 
Lists.newArrayList(mergeStatement.getWhenClauses());
+    
+    Optional<String> extraPredicate = whenClauses.stream()
+      .filter(whenClause -> !(whenClause instanceof 
MergeStatement.InsertClause))
+      .map(MergeStatement.WhenClause::getExtraPredicate)
+      .map(Strings::nullToEmpty)
+      .reduce((p1, p2) -> isNotBlank(p2) ? p1 + " OR " + p2 : p2);
+
+    whenClauses.removeIf(whenClause -> whenClause instanceof 
MergeStatement.DeleteClause);
+    extraPredicate.ifPresent(p -> whenClauses.add(new 
MergeStatement.DeleteClause(p, null)));
+
+    MergeStatement.MergeSqlGenerator mergeSqlGenerator = 
createMergeSqlGenerator(mergeStatement, sqlGenerator);
+
+    for (MergeStatement.WhenClause whenClause : whenClauses) {
+      whenClause.toSql(mergeSqlGenerator);
+    }
+    
+    // TODO: handleCardinalityViolation;
+    
+    ParseUtils.ReparseResult rr = ParseUtils.parseRewrittenQuery(ctx, 
sqlGenerator.toString());
+    Context rewrittenCtx = rr.rewrittenCtx;
+    setOperation(rewrittenCtx);
+
+    //set dest name mapping on new context; 1st child is TOK_FROM
+    rewrittenCtx.addDestNamePrefix(1, Context.DestClausePrefix.MERGE);
+    return rr;
+  }
+
+  @Override
+  protected CopyOnWriteMergeWhenClauseSqlGenerator createMergeSqlGenerator(
+      MergeStatement mergeStatement, MultiInsertSqlGenerator sqlGenerator) {
+    return new CopyOnWriteMergeWhenClauseSqlGenerator(conf, sqlGenerator, 
mergeStatement);
+  }
+  
+  private void handleSource(MergeStatement mergeStatement, 
MultiInsertSqlGenerator sqlGenerator) {
+    boolean hasWhenNotMatchedInsertClause = 
mergeStatement.hasWhenNotMatchedInsertClause();
+    
+    String sourceName = mergeStatement.getSourceName();
+    String sourceAlias = mergeStatement.getSourceAlias();
+    
+    String targetAlias = mergeStatement.getTargetAlias();
+    String onClauseAsString = replaceColumnRefsWithTargetPrefix(targetAlias, 
mergeStatement.getOnClauseAsText());
+
+    sqlGenerator.newCteExpr();
+    
+    sqlGenerator.append(sourceName + " AS ( SELECT * FROM\n");
+    sqlGenerator.append("(SELECT ");
+    sqlGenerator.appendAcidSelectColumns(Context.Operation.MERGE);
+    sqlGenerator.appendAllColsOfTargetTable(TARGET_PREFIX);
+    sqlGenerator.append(" FROM ").appendTargetTableName().append(") ");
+    sqlGenerator.append(targetAlias);
+    sqlGenerator.append('\n');
+    sqlGenerator.indent().append(hasWhenNotMatchedInsertClause ? "FULL OUTER 
JOIN" : "LEFT OUTER JOIN").append("\n");
+    sqlGenerator.indent().append(sourceAlias);
+    sqlGenerator.append('\n');
+    sqlGenerator.indent().append("ON ").append(onClauseAsString);
+    sqlGenerator.append('\n');
+    sqlGenerator.append(")");
+    
+    sqlGenerator.addCteExpr();
+  }
+
+  private static String replaceColumnRefsWithTargetPrefix(String columnRef, 
String strValue) {
+    return strValue.replaceAll(columnRef + "\\.(`?)", "$1" + TARGET_PREFIX);
+  }
+
+  static class CopyOnWriteMergeWhenClauseSqlGenerator extends 
MergeRewriter.MergeWhenClauseSqlGenerator {
+
+    private final COWWithClauseBuilder cowWithClauseBuilder;
+
+    CopyOnWriteMergeWhenClauseSqlGenerator(
+      HiveConf conf, MultiInsertSqlGenerator sqlGenerator, MergeStatement 
mergeStatement) {
+      super(conf, sqlGenerator, mergeStatement);
+      this.cowWithClauseBuilder = new COWWithClauseBuilder();
+    }
+
+    @Override
+    public void appendWhenNotMatchedInsertClause(MergeStatement.InsertClause 
insertClause) {
+      String targetAlias = mergeStatement.getTargetAlias();
+      
+      if (mergeStatement.getWhenClauses().size() > 1) {
+        sqlGenerator.append("union all\n");
+      }
+      sqlGenerator.append("    -- insert clause\n").append("SELECT ");
+      
+      if (isNotBlank(hintStr)) {
+        sqlGenerator.append(hintStr);
+        hintStr = null;
+      }
+      List<String> values = 
sqlGenerator.getDeleteValues(Context.Operation.MERGE);
+      values.add(insertClause.getValuesClause());
+      
+      sqlGenerator.append(StringUtils.join(values, ","));
+      sqlGenerator.append("\nFROM " + mergeStatement.getSourceName());
+      sqlGenerator.append("\n   WHERE ");
+      
+      StringBuilder whereClause = new 
StringBuilder(insertClause.getPredicate());
+      
+      if (insertClause.getExtraPredicate() != null) {
+        //we have WHEN NOT MATCHED AND <boolean expr> THEN INSERT
+        whereClause.append(" AND ").append(insertClause.getExtraPredicate());
+      }
+      sqlGenerator.append(
+          replaceColumnRefsWithTargetPrefix(targetAlias, 
whereClause.toString()));
+      sqlGenerator.append('\n');
+    }
+
+    @Override
+    public void appendWhenMatchedUpdateClause(MergeStatement.UpdateClause 
updateClause) {
+      Table targetTable = mergeStatement.getTargetTable();
+      String targetAlias = mergeStatement.getTargetAlias();
+      String onClauseAsString = mergeStatement.getOnClauseAsText();
+
+      UnaryOperator<String> columnRefsFunc = value -> 
replaceColumnRefsWithTargetPrefix(targetAlias, value);

Review Comment:
   How about
   ```
   private UnaryOperator<String> columnRefReplacer(String targetAlias) {
     return value -> replaceColumnRefsWithTargetPrefix(targetAlias, value);
   }
   ```
   I don't have a strong opinion about this I leave it to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27794: Iceberg: Implement Copy-On-Write for Merge queries [hive]

Reply via email to