date:20231027

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783725463

   
   ## CI report:
   
   * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN
   * 03c759528edb8ce1e0997b5b98921fa6343f1f22 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20535)
 
   * 69bed1eb6e45fda4b2380ae16f78004968570781 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783725481

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783724053

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783724077

   
   ## CI report:
   
   * cef842213bef83d99e50a8babc2224c9904b22eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20536)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6495][RFC-66] Non-blocking Concurrency Control [hudi]

2023-10-27 Thread via GitHub



peanut-chenzhong commented on PR #7907:
URL: https://github.com/apache/hudi/pull/7907#issuecomment-1783720492

   for this feature, how do we handle the failure writing commits，will it be 
rollback by other writing tasks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179496


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map, Schema>,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param olderOlder {@link HoodieSparkRecord}.
+   * @param oldSchemaOld schema.
+   * @param newerNewer {@link HoodieSparkRecord}.
+   * @param newSchemaNew schema.
+   * @param readerSchema Reader schema containing all the fields to read.
+   * @param propsConfiguration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
Schema readerSchema,
+ 
TypedProperties props) {
+Pair, Pair> mergedSchemaPair 
=
+getCachedMergedSchema(oldSchema, newSchema, readerSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mergedSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mergedSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+Integer ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+if (ordInPartialUpdate != null) {
+  // pick from new
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mergedSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mergedSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public static

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179496


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map, Schema>,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param olderOlder {@link HoodieSparkRecord}.
+   * @param oldSchemaOld schema.
+   * @param newerNewer {@link HoodieSparkRecord}.
+   * @param newSchemaNew schema.
+   * @param readerSchema Reader schema containing all the fields to read.
+   * @param propsConfiguration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
Schema readerSchema,
+ 
TypedProperties props) {
+Pair, Pair> mergedSchemaPair 
=
+getCachedMergedSchema(oldSchema, newSchema, readerSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mergedSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mergedSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+Integer ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+if (ordInPartialUpdate != null) {
+  // pick from new
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mergedSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mergedSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public static

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783714779

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179372


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map, Schema>,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param olderOlder {@link HoodieSparkRecord}.
+   * @param oldSchemaOld schema.
+   * @param newerNewer {@link HoodieSparkRecord}.
+   * @param newSchemaNew schema.
+   * @param readerSchema Reader schema containing all the fields to read.
+   * @param propsConfiguration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
Schema readerSchema,
+ 
TypedProperties props) {
+Pair, Pair> mergedSchemaPair 
=
+getCachedMergedSchema(oldSchema, newSchema, readerSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mergedSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mergedSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+Integer ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+if (ordInPartialUpdate != null) {
+  // pick from new
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mergedSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mergedSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public static

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179246


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedFileGroupRecordBuffer.java:
##
@@ -84,6 +79,10 @@ public void processDataBlock(HoodieDataBlock dataBlock, 
Option keySpecO
   isFullKey = keySpecOpt.get().isFullKey();
 }
 
+if (dataBlock.containsPartialUpdates()) {
+  enablePartialMerging = true;
+}

Review Comment:
   The buffer consumes multiple log blocks in the file group.  When there are 
partial updates in one log block, we switch the merging mode and do not switch 
back going forward to guarantee correct merging results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179085


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +48,9 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+  default Option> partialMerge(HoodieRecord older, 
Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, 
TypedProperties props) throws IOException {
+throw new HoodieException("Partial merging logic is not implemented.");

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178992


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java:
##
@@ -440,7 +443,9 @@ private static void validateRow(InternalRow data, 
StructType schema) {
 //   corresponding schema has to be provided as well so that it could 
be properly
 //   serialized (in case it would need to be)
 boolean isValid = data == null || data instanceof UnsafeRow
-|| schema != null && (data instanceof HoodieInternalRow || 
SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data));
+|| schema != null && (data instanceof HoodieInternalRow
+|| data instanceof GenericInternalRow

Review Comment:
   The line is too long.  I've revised the indentation to make it more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783713220

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   * 00c91e2e52ef18c5880de00c450ad059090efc7d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178539


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java:
##
@@ -76,6 +77,45 @@ public Option> merge(HoodieRecord 
older, Schema oldSc
 }
   }
 
+  @Override
+  public Option> partialMerge(HoodieRecord older, 
Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, 
TypedProperties props) throws IOException {
+ValidationUtils.checkArgument(older.getRecordType() == 
HoodieRecordType.SPARK);
+ValidationUtils.checkArgument(newer.getRecordType() == 
HoodieRecordType.SPARK);
+
+if (newer instanceof HoodieSparkRecord) {
+  HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer;
+  if (newSparkRecord.isDeleted()) {
+// Delete record
+return Option.empty();
+  }
+} else {
+  if (newer.getData() == null) {
+// Delete record
+return Option.empty();
+  }
+}
+
+if (older instanceof HoodieSparkRecord) {
+  HoodieSparkRecord oldSparkRecord = (HoodieSparkRecord) older;
+  if (oldSparkRecord.isDeleted()) {
+// use natural order for delete record
+return Option.of(Pair.of(newer, newSchema));
+  }
+} else {
+  if (older.getData() == null) {
+// use natural order for delete record
+return Option.of(Pair.of(newer, newSchema));
+  }
+}
+if (older.getOrderingValue(oldSchema, 
props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) {
+  return Option.of(SparkRecordMergingUtils.mergeCompleteOrPartialRecords(
+  (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, 
oldSchema, readerSchema, props));

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178419


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param older Older {@link HoodieSparkRecord}.
+   * @param oldSchema Old schema.
+   * @param newer Newer {@link HoodieSparkRecord}.
+   * @param newSchema New schema.
+   * @param props Configuration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
TypedProperties props) {
+Pair, Pair> 
mappingSchemaPair =
+getCachedMergedSchema(oldSchema, newSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mappingSchemaPair.getRight().getRight());

Review Comment:
   A new `partialMerge` API is created so that partial merging logic is not 
called if not necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178329


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param older Older {@link HoodieSparkRecord}.
+   * @param oldSchema Old schema.
+   * @param newer Newer {@link HoodieSparkRecord}.
+   * @param newSchema New schema.
+   * @param props Configuration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
TypedProperties props) {
+Pair, Pair> 
mappingSchemaPair =
+getCachedMergedSchema(oldSchema, newSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mappingSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mappingSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+if (newPartialNameToIdMapping.containsKey(structField.name())) {
+  // pick from new
+  int ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mappingSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mappingSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public static Map 
getCachedFieldIdToFieldMapping(Schema avroSchema) {
+return FIELD_ID_TO_FIELD_MAPPING_CACHE.computeIfAbsent(avroSchema, schema 
-> {
+  StructType structType = HoodieInternalRowUtil

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178224


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -51,6 +61,28 @@ class 
SparkFileFormatInternalRowReaderContext(baseFileReader: PartitionedFile =>
  requiredSchema: Schema,
  conf: Configuration): 
ClosableIterator[InternalRow] = {
 val fileInfo = 
sparkAdapter.getSparkPartitionedFileUtils.createPartitionedFile(partitionValues,
 filePath, start, length)
-new CloseableInternalRowIterator(baseFileReader.apply(fileInfo))
+if (filePath.toString.contains(HoodieLogFile.DELTA_EXTENSION)) {

Review Comment:
   `FsUtils.isLogFile` is fixed by considering `inlinefs://` scheme.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-27 Thread via GitHub



CTTY commented on PR #9895:
URL: https://github.com/apache/hudi/pull/9895#issuecomment-1783709069

   Hi @boneanxs , I'm working on Spark 3.5 support and noticed 2 new failures 
after cherry-picking your commit. I tried porting this changes for Spark 3.5 
but it didn't help. I'm still looking into the test failures but would 
appreciate if you can help take a look as well! My PR: #9717 
   
   - Failure 1
   ```
   2023-10-28T02:40:26.0783904Z - Test Create/Show/Drop Secondary Index *** 
FAILED ***
   2023-10-28T02:40:26.0785667Z   org.apache.spark.sql.AnalysisException: 
CreateIndex is not supported in this table default.h0.
   2023-10-28T02:40:26.0787735Z   at 
org.apache.spark.sql.errors.QueryCompilationErrors$.tableIndexNotSupportedError(QueryCompilationErrors.scala:3259)
   ```
   - Failure 2
   ```
   T03:03:39.6074356Z - Test Create/Drop/Show/Refresh Index *** FAILED ***
   2023-10-28T03:03:39.6077467Z   java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.CreateIndex cannot be cast to 
org.apache.spark.sql.hudi.command.CreateIndexCommand
   2023-10-28T03:03:39.6079990Z   at 
org.apache.spark.sql.hudi.command.index.TestIndexSyntax.$anonfun$new$3(TestIndexSyntax.scala:67)
   2023-10-28T03:03:39.6081642Z   at 
scala.collection.immutable.List.foreach(List.scala:431)
   2023-10-28T03:03:39.6143392Z   at 
org.apache.spark.sql.hudi.command.index.TestIndexSyntax.$anonfun$new$2(TestIndexSyntax.scala:34)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783701937

   
   ## CI report:
   
   * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530)
 
   * cef842213bef83d99e50a8babc2224c9904b22eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20536)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6999) Add row writer support to Deltastreamer

2023-10-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6999:
-
Labels: pull-request-available  (was: )

> Add row writer support to Deltastreamer
> ---
>
> Key: HUDI-6999
> URL: https://issues.apache.org/jira/browse/HUDI-6999
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We have not yet leveraged row writer support in Deltastreamer. we can benefit 
> from perf improvement if we can integrate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783700200

   
   ## CI report:
   
   * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530)
 
   * cef842213bef83d99e50a8babc2224c9904b22eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375161773


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map, Schema>,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param olderOlder {@link HoodieSparkRecord}.
+   * @param oldSchemaOld schema.
+   * @param newerNewer {@link HoodieSparkRecord}.
+   * @param newSchemaNew schema.
+   * @param readerSchema Reader schema containing all the fields to read.
+   * @param propsConfiguration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
Schema readerSchema,
+ 
TypedProperties props) {
+Pair, Pair> mergedSchemaPair 
=
+getCachedMergedSchema(oldSchema, newSchema, readerSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mergedSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mergedSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+Integer ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+if (ordInPartialUpdate != null) {
+  // pick from new
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mergedSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mergedSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public st

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375161288


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java:
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.merge;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+import org.apache.spark.sql.HoodieInternalRowUtils;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+
+/**
+ * Util class to merge records that may contain partial updates.
+ * This can be plugged into any Spark {@link HoodieRecordMerger} 
implementation.
+ */
+public class SparkRecordMergingUtils {
+  private static final Map> 
FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map> 
FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>();
+  private static final Map, Schema>,
+  Pair, Pair>> 
MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>();
+
+  /**
+   * Merges records which can contain partial updates.
+   *
+   * @param olderOlder {@link HoodieSparkRecord}.
+   * @param oldSchemaOld schema.
+   * @param newerNewer {@link HoodieSparkRecord}.
+   * @param newSchemaNew schema.
+   * @param readerSchema Reader schema containing all the fields to read.
+   * @param propsConfiguration in {@link TypedProperties}.
+   * @return The merged record.
+   */
+  public static Pair 
mergeCompleteOrPartialRecords(HoodieSparkRecord older,
+ 
Schema oldSchema,
+ 
HoodieSparkRecord newer,
+ 
Schema newSchema,
+ 
Schema readerSchema,
+ 
TypedProperties props) {
+Pair, Pair> mergedSchemaPair 
=
+getCachedMergedSchema(oldSchema, newSchema, readerSchema);
+boolean isNewerPartial = isPartial(newSchema, 
mergedSchemaPair.getRight().getRight());
+if (isNewerPartial) {
+  InternalRow oldRow = older.getData();
+  InternalRow newPartialRow = newer.getData();
+
+  Map mergedIdToFieldMapping = 
mergedSchemaPair.getLeft();
+  Map newPartialNameToIdMapping = 
getCachedFieldNameToIdMapping(newSchema);
+  List values = new ArrayList<>(mergedIdToFieldMapping.size());
+  for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); 
fieldId++) {
+StructField structField = mergedIdToFieldMapping.get(fieldId);
+Integer ordInPartialUpdate = 
newPartialNameToIdMapping.get(structField.name());
+if (ordInPartialUpdate != null) {
+  // pick from new
+  values.add(newPartialRow.get(ordInPartialUpdate, 
structField.dataType()));
+} else {
+  // pick from old
+  values.add(oldRow.get(fieldId, structField.dataType()));
+}
+  }
+  InternalRow mergedRow = new GenericInternalRow(values.toArray());
+
+  HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord(
+  mergedRow, mergedSchemaPair.getRight().getLeft());
+  return Pair.of(mergedSparkRecord, 
mergedSchemaPair.getRight().getRight());
+} else {
+  return Pair.of(newer, newSchema);
+}
+  }
+
+  /**
+   * @param avroSchema Avro schema.
+   * @return The field ID to {@link StructField} instance mapping.
+   */
+  public st

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375160946


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedFileGroupRecordBuffer.java:
##
@@ -84,6 +79,10 @@ public void processDataBlock(HoodieDataBlock dataBlock, 
Option keySpecO
   isFullKey = keySpecOpt.get().isFullKey();
 }
 
+if (dataBlock.containsPartialUpdates()) {
+  enablePartialMerging = true;
+}

Review Comment:
   Is there any chance this flag switch back to false? Can the buffer consumes 
multiple log blocks?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6999) Add row writer support to Deltastreamer

2023-10-27 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-6999:
-

 Summary: Add row writer support to Deltastreamer
 Key: HUDI-6999
 URL: https://issues.apache.org/jira/browse/HUDI-6999
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: sivabalan narayanan


We have not yet leveraged row writer support in Deltastreamer. we can benefit 
from perf improvement if we can integrate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375160672


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +48,9 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+  default Option> partialMerge(HoodieRecord older, 
Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, 
TypedProperties props) throws IOException {
+throw new HoodieException("Partial merging logic is not implemented.");

Review Comment:
   throws `UnSupportedException`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375159973


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java:
##
@@ -440,7 +443,9 @@ private static void validateRow(InternalRow data, 
StructType schema) {
 //   corresponding schema has to be provided as well so that it could 
be properly
 //   serialized (in case it would need to be)
 boolean isValid = data == null || data instanceof UnsafeRow
-|| schema != null && (data instanceof HoodieInternalRow || 
SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data));
+|| schema != null && (data instanceof HoodieInternalRow
+|| data instanceof GenericInternalRow

Review Comment:
   The original indentation is more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1375159611


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java:
##
@@ -76,6 +77,45 @@ public Option> merge(HoodieRecord 
older, Schema oldSc
 }
   }
 
+  @Override
+  public Option> partialMerge(HoodieRecord older, 
Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, 
TypedProperties props) throws IOException {
+ValidationUtils.checkArgument(older.getRecordType() == 
HoodieRecordType.SPARK);
+ValidationUtils.checkArgument(newer.getRecordType() == 
HoodieRecordType.SPARK);
+
+if (newer instanceof HoodieSparkRecord) {
+  HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer;
+  if (newSparkRecord.isDeleted()) {
+// Delete record
+return Option.empty();
+  }
+} else {
+  if (newer.getData() == null) {
+// Delete record
+return Option.empty();
+  }
+}
+
+if (older instanceof HoodieSparkRecord) {
+  HoodieSparkRecord oldSparkRecord = (HoodieSparkRecord) older;
+  if (oldSparkRecord.isDeleted()) {
+// use natural order for delete record
+return Option.of(Pair.of(newer, newSchema));
+  }
+} else {
+  if (older.getData() == null) {
+// use natural order for delete record
+return Option.of(Pair.of(newer, newSchema));
+  }
+}
+if (older.getOrderingValue(oldSchema, 
props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) {
+  return Option.of(SparkRecordMergingUtils.mergeCompleteOrPartialRecords(
+  (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, 
oldSchema, readerSchema, props));

Review Comment:
   We can rename the method back to `mergePartialRecords`now ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783687707

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on PR #9933:
URL: https://github.com/apache/hudi/pull/9933#issuecomment-1783686379

   
[6997.patch.zip](https://github.com/apache/hudi/files/13194360/6997.patch.zip)
   Thanks for the contribution, I have reviewed and created a patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783686118

   
   ## CI report:
   
   * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN
   * 03c759528edb8ce1e0997b5b98921fa6343f1f22 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20535)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783686071

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20534)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783686047

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * 6ada390928851770f175657a992b08269bcacb28 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532)
 
   * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783684492

   
   ## CI report:
   
   * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN
   * 5b39c6dc67dbe4273bfcc704f644c925e76f8fa2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20523)
 
   * 03c759528edb8ce1e0997b5b98921fa6343f1f22 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9933:
URL: https://github.com/apache/hudi/pull/9933#discussion_r1375155933


##
hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java:
##
@@ -35,9 +35,15 @@ public enum WriteConcurrencyMode {
   @EnumFieldDescription("Multiple writers can operate on the table with lazy 
conflict resolution "
   + "using locks. This means that only one writer succeeds if multiple 
writers write to the "
   + "same file group.")
-  OPTIMISTIC_CONCURRENCY_CONTROL;
+  OPTIMISTIC_CONCURRENCY_CONTROL,
 
-  public boolean supportsOptimisticConcurrencyControl() {
-return this == OPTIMISTIC_CONCURRENCY_CONTROL;
+  // Multiple writer can perform write ops on a MOR table with simple bucket 
index, and defer conflict
+  // resolution to compaction phase
+  @EnumFieldDescription("Multiple writer can perform write ops on a MOR table 
with simple bucket index, "
+  + "and defer conflict resolution to compaction phase.")
+  NON_BLOCKING_CONCURRENCY_CONTROL;
+
+  public boolean supportsConcurrencyControl() {
+return this == OPTIMISTIC_CONCURRENCY_CONTROL || this == 
NON_BLOCKING_CONCURRENCY_CONTROL;

Review Comment:
   `supportsConcurrencyControl()` -> `supportsMultiWriter()` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9933:
URL: https://github.com/apache/hudi/pull/9933#discussion_r1375154861


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteCopyOnWrite.java:
##
@@ -520,27 +523,39 @@ public void testWriteExactlyOnce() throws Exception {
   // case1: txn2's time range is involved in txn1
   //  |--- txn1 ---|
   //  | - txn2 - |
-  @Test
-  public void testWriteMultiWriterInvolved() throws Exception {
-conf.setString(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), 
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name());
+  @ParameterizedTest
+  @ValueSource(strings = {"OPTIMISTIC_CONCURRENCY_CONTROL", 
"NON_BLOCKING_CONCURRENCY_CONTROL"})

Review Comment:
   You can use enumSource instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9933:
URL: https://github.com/apache/hudi/pull/9933#discussion_r1375154698


##
hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java:
##
@@ -35,9 +35,15 @@ public enum WriteConcurrencyMode {
   @EnumFieldDescription("Multiple writers can operate on the table with lazy 
conflict resolution "
   + "using locks. This means that only one writer succeeds if multiple 
writers write to the "
   + "same file group.")
-  OPTIMISTIC_CONCURRENCY_CONTROL;
+  OPTIMISTIC_CONCURRENCY_CONTROL,
 
-  public boolean supportsOptimisticConcurrencyControl() {
-return this == OPTIMISTIC_CONCURRENCY_CONTROL;
+  // Multiple writer can perform write ops on a MOR table with simple bucket 
index, and defer conflict
+  // resolution to compaction phase
+  @EnumFieldDescription("Multiple writer can perform write ops on a MOR table 
with simple bucket index, "

Review Comment:
   "Multiple writers can operate on the table with non-blocking conflict 
resolution. The writers can write into the same file group with the conflicts 
resolved automically by the query reader and the compactor."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9933:
URL: https://github.com/apache/hudi/pull/9933#discussion_r1375153787


##
hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java:
##
@@ -35,9 +35,15 @@ public enum WriteConcurrencyMode {
   @EnumFieldDescription("Multiple writers can operate on the table with lazy 
conflict resolution "
   + "using locks. This means that only one writer succeeds if multiple 
writers write to the "
   + "same file group.")
-  OPTIMISTIC_CONCURRENCY_CONTROL;
+  OPTIMISTIC_CONCURRENCY_CONTROL,
 
-  public boolean supportsOptimisticConcurrencyControl() {
-return this == OPTIMISTIC_CONCURRENCY_CONTROL;
+  // Multiple writer can perform write ops on a MOR table with simple bucket 
index, and defer conflict
+  // resolution to compaction phase

Review Comment:
   No need to comment 2 times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783676321

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529)
 
   * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20534)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9933:
URL: https://github.com/apache/hudi/pull/9933#discussion_r1375153204


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2638,19 +2638,23 @@ public Integer getWritesFileIdEncoding() {
   }
 
   public boolean needResolveWriteConflict(WriteOperationType operationType) {
-if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
-  // NB-CC don't need to resolve write conflict except bulk insert 
operation
-  return WriteOperationType.BULK_INSERT == operationType || 
!isNonBlockingConcurrencyControl();
-} else {
-  // SINGLE_WRITER case don't need to resolve write conflict
-  return false;
+WriteConcurrencyMode mode = getWriteConcurrencyMode();
+boolean needResolveConflict;
+switch (mode) {
+  case SINGLE_WRITER:
+needResolveConflict = false;

Review Comment:
   Return directly ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783674701

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529)
 
   * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



nsivabalan commented on code in PR #9913:
URL: https://github.com/apache/hudi/pull/9913#discussion_r1375149713


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -584,10 +604,16 @@ private Pair>> fetchFromSourc
 (SchemaProvider) new DelegatingSchemaProvider(props, 
hoodieSparkContext.jsc(), dataAndCheckpoint.getSchemaProvider(),
 new SimpleSchemaProvider(hoodieSparkContext.jsc(), 
targetSchema, props)))
 .orElse(dataAndCheckpoint.getSchemaProvider());
+if (rowBulkInsert) {
+  return new InputBatch(transformed, checkpointStr, schemaProvider);

Review Comment:
   we handle empty commits. wrt key gen, it will be handled in 
BaseDatasetBulkInsertCommitActionExecutor. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



nsivabalan commented on code in PR #9913:
URL: https://github.com/apache/hudi/pull/9913#discussion_r1375149768


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -541,6 +558,9 @@ private Pair>> fetchFromSourc
   checkpointStr = dataAndCheckpoint.getCheckpointForNextBatch();
   boolean reconcileSchema = 
props.getBoolean(DataSourceWriteOptions.RECONCILE_SCHEMA().key());
   if (this.userProvidedSchemaProvider != null && 
this.userProvidedSchemaProvider.getTargetSchema() != null) {
+if (rowBulkInsert) {
+  return new InputBatch(transformed, checkpointStr, 
this.userProvidedSchemaProvider);

Review Comment:
   yes. main purpose of using row writer is for perf. but if we were to do 
this, might incur perf hit. So, in V0, I am punting error table for row writer 
use-cases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783644465

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521)
 
   * 6ada390928851770f175657a992b08269bcacb28 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532)
 
   * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783642064

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521)
 
   * 6ada390928851770f175657a992b08269bcacb28 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532)
 
   * ef1785580edd0f4514576ec261aa0d9a17f71aa4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783639371

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521)
 
   * 6ada390928851770f175657a992b08269bcacb28 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783620053

   
   ## CI report:
   
   * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN
   * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521)
 
   * 6ada390928851770f175657a992b08269bcacb28 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition. [hudi]

2023-10-27 Thread via GitHub



bvaradar commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1783579700

   @majian1998 : The conflict resolution happens within the context of 
LockManager transaction which should serialize the checking. From what I 
understand, the conflict resolution strategy needs to handle replace commits. 
We need to incorporate the checking similar to 
DeletePartitionUtils.checkForPendingTableServiceActions as part of conflict 
resolution. Currently, DeletePartitionUtils.checkForPendingTableServiceActions 
is being called outside of transaction scope causing this. Let me know if this 
makes sense. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] "OutOfMemoryError: Requested array size exceeds VM limit" on data ingestion to MOR table [hudi]

2023-10-27 Thread via GitHub



mzheng-plaid commented on issue #9934:
URL: https://github.com/apache/hudi/issues/9934#issuecomment-1783573941

   Is there a way to configure the pretty printer 
`JsonUtils.getObjectMapper().writerWithDefaultPrettyPrinter().writeValueAsString(this);
 ` to be more compact? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783420799

   
   ## CI report:
   
   * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] "OutOfMemoryError: Requested array size exceeds VM limit" on data ingestion to MOR table [hudi]

2023-10-27 Thread via GitHub



mzheng-plaid opened a new issue, #9934:
URL: https://github.com/apache/hudi/issues/9934

   **Describe the problem you faced**
   We have a MOR table that is ingested to using a Spark Structured Streaming 
pipeline. 
   
   We are seeing:
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o355.save.
   : java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:292)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:243)
at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:126)
at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:701)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:345)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1$$Lambda$4086/588517446.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2384/2044625832.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2383/299085843.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138)
at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$2373/595359931.apply(Unknown
 Source)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:626)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1362/580263940.apply(Unknown
 Source)
   ```
   
   It seems like this is happening on commit (ie. its writing the data 
successfully) and each time it retries it has to rollback (and each rollback is 
getting more and more expensive).
   
   **To Reproduce**
   
   Unclear.
   
   **Expected behavior**
   
   We are not sure how to recover from this bad state. Is this loading in the 
`deltacommits` from the timeline and trying to create an array thats too large? 
Or is this stack trace indicating its a problem with the current batch (we've 
tried turning down the batch size with no change)
   
   EMR 6.10.1
   
   * Hudi version : 0.12.2-amzn-0
   
   * Spark version : 3.3.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : Spark on Docker
   
   
   **Additional context**
   
   The `.hoodie` path is below:
   ```
  PRE .aux/
   2023-10-12 15:58:18  0 .aux_$folder$
   2023-10-12 15:58:17  0 .schema_$folder$
   2023-10-12 15:58:17  0 .temp_$folder$
   2023-10-23 11:56:36  13120 20231023185628734.deltacommit
   2023-10-23 11:56:33786 20231023185628734.deltacommit.inflight
   2023-10-23 11:56:30  0 20231023185628734.deltacommit.requested
   2023-10

Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]

2023-10-27 Thread via GitHub



arunvasudevan commented on issue #9861:
URL: https://github.com/apache/hudi/issues/9861#issuecomment-1783346605

   @ad1happy2go Messaged you on Hudi Slack. We can connect more about this 
issue in slack, Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 (#9895)

2023-10-27 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4f723fb57ec [HUDI-6963] Fix class conflict of CreateIndex from 
Spark3.3 (#9895)
4f723fb57ec is described below

commit 4f723fb57ec3d859893d8eee80e1d2b6ceb05d18
Author: Rex(Hui) An 
AuthorDate: Sat Oct 28 02:14:57 2023 +0800

[HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 (#9895)

CreateIndex is added in 
[HUDI-4165](https://github.com/apache/hudi/pull/5761/files), and spark 3.3 also 
include this in [SPARK-36895](https://github.com/apache/spark/pull/34148). 
Since `CreateIndex` uses same package 
`org.apache.spark.sql.catalyst.plans.logical` in HUDI and Spark3.3, but params 
are not same. So it could introduce class conflict issues if we use it.

This commit still keeps the same package path with Spark, but changes to

1. Use the same params like Spark, so there should be no class conflict
2. Only support Index related commands from **Spark3.2**, since Spark2 
doesn't have `org.apache.spark.sql.catalyst.analysis.FieldName` but 
`CreateIndex` requires
3. Resolve columns for CreateIndex during Analyze stage
---
 .../spark/sql/HoodieCatalystPlansUtils.scala   |  20 ++-
 .../hudi/spark/sql/parser/HoodieSqlCommon.g4   |  68 --
 .../spark/sql/hudi/analysis/HoodieAnalysis.scala   |  37 --
 .../spark/sql/hudi/command/IndexCommands.scala |  26 +---
 .../sql/parser/HoodieSqlCommonAstBuilder.scala | 142 +
 .../sql/hudi/command/index/TestIndexSyntax.scala   |  97 +++---
 .../hudi/command/index/TestSecondaryIndex.scala| 108 
 .../spark/sql/HoodieSpark2CatalystPlanUtils.scala  |  13 ++
 .../spark/sql/HoodieSpark30CatalystPlanUtils.scala |  13 +-
 .../spark/sql/HoodieSpark31CatalystPlanUtils.scala |  13 +-
 .../src/main/antlr4/imports/SqlBase.g4 |  32 +
 .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4  |   7 +
 .../spark/sql/HoodieSpark32CatalystPlanUtils.scala |  38 +-
 .../HoodieSpark3_2ExtendedSqlAstBuilder.scala  | 139 
 .../parser/HoodieSpark3_2ExtendedSqlParser.scala   |   6 +-
 .../spark/sql/catalyst/plans/logical/Index.scala   |  46 ++-
 .../hudi/analysis/HoodieSpark32PlusAnalysis.scala  |  38 +-
 .../src/main/antlr4/imports/SqlBase.g4 |  32 +
 .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4  |   7 +
 .../spark/sql/HoodieSpark33CatalystPlanUtils.scala |  38 +-
 .../HoodieSpark3_3ExtendedSqlAstBuilder.scala  | 139 
 .../parser/HoodieSpark3_3ExtendedSqlParser.scala   |   6 +-
 .../src/main/antlr4/imports/SqlBase.g4 |  32 +
 .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4  |   7 +
 .../spark/sql/HoodieSpark34CatalystPlanUtils.scala |  38 +-
 .../HoodieSpark3_4ExtendedSqlAstBuilder.scala  | 139 
 .../parser/HoodieSpark3_4ExtendedSqlParser.scala   |   6 +-
 27 files changed, 909 insertions(+), 378 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
index 9cfe23f86cc..64ee645ba0f 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
@@ -22,7 +22,6 @@ import 
org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat
 import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
 import org.apache.spark.sql.catalyst.plans.JoinType
 import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
-import org.apache.spark.sql.execution.datasources.HadoopFsRelation
 import org.apache.spark.sql.internal.SQLConf
 
 trait HoodieCatalystPlansUtils {
@@ -79,6 +78,25 @@ trait HoodieCatalystPlansUtils {
*/
   def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, 
LogicalPlan, Expression)]
 
+  /**
+   * Decomposes [[MatchCreateIndex]] into its arguments with accommodation.
+   */
+  def unapplyCreateIndex(plan: LogicalPlan): Option[(LogicalPlan, String, 
String, Boolean, Seq[(Seq[String], Map[String, String])], Map[String, String])]
+
+  /**
+   * Decomposes [[MatchDropIndex]] into its arguments with accommodation.
+   */
+  def unapplyDropIndex(plan: LogicalPlan): Option[(LogicalPlan, String, 
Boolean)]
+
+  /**
+   * Decomposes [[MatchShowIndexes]] into its arguments with accommodation.
+   */
+  def unapplyShowIndexes(plan: LogicalPlan): Option[(LogicalPlan, 
Seq[Attribute])]
+
+  /**
+   * Decomposes [[MatchRefreshIndex]] into its arguments with accommodation.
+   */
+  def unapplyRefreshIndex(plan: LogicalPlan): Option[(LogicalPlan, String)]
 
   /**

Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-27 Thread via GitHub



yihua merged PR #9895:
URL: https://github.com/apache/hudi/pull/9895


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6977) Upgrade hadoop version from 2.10.1 to 2.10.2

2023-10-27 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6977.
---
Resolution: Fixed

> Upgrade hadoop version from 2.10.1 to 2.10.2
> 
>
> Key: HUDI-6977
> URL: https://issues.apache.org/jira/browse/HUDI-6977
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> https://github.com/apache/hudi/pull/9768#discussion_r1370266395
> This release contains important security fixes for log4shell 
> [https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-common/release/2.10.2/RELEASENOTES.2.10.2.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783299568

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783299911

   
   ## CI report:
   
   * c24e21a247718190cf8796e30fd837aa1eb465a6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20512)
 
   * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9913:
URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783289891

   
   ## CI report:
   
   * c24e21a247718190cf8796e30fd837aa1eb465a6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20512)
 
   * 01519ff2eb856bbdd487162a21df52d8a7c216cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783289502

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 43f9d0bbc914d6bd09bffe1285de8a51960cc979 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20517)
 
   * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9933:
URL: https://github.com/apache/hudi/pull/9933#issuecomment-1783224181

   
   ## CI report:
   
   * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20528)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] first draft [hudi]

2023-10-27 Thread via GitHub



jonvex commented on code in PR #9913:
URL: https://github.com/apache/hudi/pull/9913#discussion_r1374684223


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -408,19 +420,25 @@ public Pair, JavaRDD> 
syncOnce() throws IOException
 .build();
 String instantTime = metaClient.createNewInstantTime();
 
-Pair>> 
srcRecordsWithCkpt = readFromSource(instantTime);
+InputBatch inputBatch = readFromSource(instantTime);
+
+if (inputBatch != null) {
+  final JavaRDD recordsFromSource;
+  if (rowBulkInsert) {

Review Comment:
   I think it's too much for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9933:
URL: https://github.com/apache/hudi/pull/9933#issuecomment-1782963803

   
   ## CI report:
   
   * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20528)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9933:
URL: https://github.com/apache/hudi/pull/9933#issuecomment-1782950737

   
   ## CI report:
   
   * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6997) Create a new WriteConcurrencyMode type for non-blocking concurrency control

2023-10-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6997:
-
Labels: pull-request-available  (was: )

> Create a new WriteConcurrencyMode type for non-blocking concurrency control
> ---
>
> Key: HUDI-6997
> URL: https://issues.apache.org/jira/browse/HUDI-6997
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, non-blocking concurrency control is enabled if a job config 
> satisfy 'OCC + MOR + simple bucket index'.
> It's a litter confusing and might lead to unexpected behavior under some 
> cases.
> It's better to create a new `WriteConcurrencyMode` type for non-blocking 
> concurrency control .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]

2023-10-27 Thread via GitHub



Armelabdelkbir commented on issue #9918:
URL: https://github.com/apache/hudi/issues/9918#issuecomment-1782829269

   @ad1happy2go  my metadata is disabled in version 0.11.0: 
"hoodie.metadata.enable" -> "false" , currently I can't upgrade until the 
client migration is complete.
   by waiting if i have empty parquets problems i need just to delete them ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9932:
URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782771021

   
   ## CI report:
   
   * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20527)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]

2023-10-27 Thread via GitHub



ksmou commented on code in PR #9925:
URL: https://github.com/apache/hudi/pull/9925#discussion_r1374450011


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "value will let the clustering job run faster, while it will give 
additional pressure to the "
   + "execution engines to manage more concurrent running jobs.");
 
+  public static final ConfigProperty 
CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
+  .key("hoodie.clustering.read.records.parallelism")
+  .defaultValue(20)

Review Comment:
   > We already have a `hoodie.clustering.max.parallelism` to control how many 
clustering jobs to submit, now this param looks to control the parallelism when 
reading per group, and it only works when row writer is disabled. This 
configuration name still confuses me.
   > 
   > Is there any other way to avoid a new configuration? What abt we directly 
use `clusteringGroup.getNumOutputFileGroups`
   
   Yes. it limits the read parallelism for single clutering group. 
`clusteringGroup.getNumOutputFileGroups` is 2 default, it's too small to read a 
1g file.
   
   Maybe we can use `hoodie.clustering.rdd.read.parallelism`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6894] ReflectionUtils is not thread safe [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9786:
URL: https://github.com/apache/hudi/pull/9786#discussion_r1337998794


##
hudi-common/src/main/java/org/apache/hudi/common/util/ReflectionUtils.java:
##
@@ -45,7 +45,7 @@ public class ReflectionUtils {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(ReflectionUtils.class);
 
-  private static final Map> CLAZZ_CACHE = new HashMap<>();
+  private static final Map> CLAZZ_CACHE = new 
ConcurrentHashMap<>();

Review Comment:
   Nice advice~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9895:
URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782691713

   
   ## CI report:
   
   * f462cf7b36016948969adb5083057a9ecbd6f794 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20525)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]

2023-10-27 Thread via GitHub



danny0405 commented on code in PR #9925:
URL: https://github.com/apache/hudi/pull/9925#discussion_r1374384477


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "value will let the clustering job run faster, while it will give 
additional pressure to the "
   + "execution engines to manage more concurrent running jobs.");
 
+  public static final ConfigProperty 
CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
+  .key("hoodie.clustering.read.records.parallelism")
+  .defaultValue(20)

Review Comment:
   Yeah, it looks like the read parallelism for a single 
`HoodieClusteringGroup`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]

2023-10-27 Thread via GitHub



boneanxs commented on code in PR #9925:
URL: https://github.com/apache/hudi/pull/9925#discussion_r1374351423


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
   + "value will let the clustering job run faster, while it will give 
additional pressure to the "
   + "execution engines to manage more concurrent running jobs.");
 
+  public static final ConfigProperty 
CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
+  .key("hoodie.clustering.read.records.parallelism")
+  .defaultValue(20)

Review Comment:
   We already have a `hoodie.clustering.max.parallelism` to control how many 
clustering jobs to submit, now this param looks to control the parallelism when 
reading per group, and it only works when row writer is disabled. This 
configuration name still confuses me.
   
   Is there any other way to avoid a new configuration? What abt we directly 
use `clusteringGroup.getNumOutputFileGroups`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] hudi deltastreamer jsonkafka source schema registry fail [hudi]

2023-10-27 Thread via GitHub



ad1happy2go commented on issue #9132:
URL: https://github.com/apache/hudi/issues/9132#issuecomment-1782637536

   @nttq1sub Were you able to resolve this issue? Feel free to close or let us 
know if you still have issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Upsert operations end up with duplicate data. Range Pruning not working properly with column statistics [hudi]

2023-10-27 Thread via GitHub



xicm commented on issue #9870:
URL: https://github.com/apache/hudi/issues/9870#issuecomment-1782592138

   Hi @ssandona, this issue is caused by Complex key generator, if 
PARTITION_FIELD is not a single field, the key generrator will be Complex, and 
there is a bug when dealing with complex keys.
   `hoodie.table.keygenerator.class` in your hoodie.properties may be 
SimpleKeyGenerator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]Index Bootstrap deleted some snapshot data that has been batch-inserted into Hudi ? [hudi]

2023-10-27 Thread via GitHub



imrewang closed issue #9513: [SUPPORT]Index Bootstrap deleted some snapshot 
data that has been batch-inserted into Hudi ?
URL: https://github.com/apache/hudi/issues/9513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9895:
URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782531655

   
   ## CI report:
   
   * f462cf7b36016948969adb5083057a9ecbd6f794 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20525)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9932:
URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782471159

   
   ## CI report:
   
   * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20527)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9895:
URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782470903

   
   ## CI report:
   
   * f462cf7b36016948969adb5083057a9ecbd6f794 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]

2023-10-27 Thread via GitHub



hudi-bot commented on PR #9932:
URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782459555

   
   ## CI report:
   
   * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete

2023-10-27 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6998:
-
Labels: pull-request-available  (was: )

> Fix drop table failure when load table as spark v2 table whose path is delete
> -
>
> Key: HUDI-6998
> URL: https://issues.apache.org/jira/browse/HUDI-6998
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> When schema evolution is enabled, hudi will load table as Spark datasource v2 
> table, which need get table schema in Analyze stage.
> We will first try to load schema from hoodie meta client since 
> [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw 
> exception if table directory does not exists, in which case we are unable to 
> drop this hudi table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]

2023-10-27 Thread via GitHub



wecharyu opened a new pull request, #9932:
URL: https://github.com/apache/hudi/pull/9932

   
   ### Change Logs
   We will first try to load schema from hoodie meta client since 
[HUDI-6219](https://issues.apache.org/jira/browse/HUDI-6219), it will throw 
exception if table directory does not exists, in which case we are unable to 
drop this hudi table.
   ```bash
   20887 [ScalaTest-run-running-TestDropTable] WARN  
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable [] - Failed to load 
table schema from meta client.
   org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in 
path 
file:/private/var/folders/6j/vkbrdtgd0x34sqjzgv7s8ckhgy/T/spark-8a41e162-dfab-4843-afcd-bb672628747d/h0/.hoodie
at 
org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:57)
 ~[classes/:?]
at 
org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:149)
 ~[classes/:?]
at 
org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:734)
 ~[classes/:?]
at 
org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:91)
 ~[classes/:?]
at 
org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:825)
 ~[classes/:?]
at 
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.metaClient$lzycompute(HoodieCatalogTable.scala:85)
 ~[classes/:3.2.3]
at 
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.metaClient(HoodieCatalogTable.scala:83)
 ~[classes/:3.2.3]
at 
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.loadTableSchemaByMetaClient(HoodieCatalogTable.scala:322)
 ~[classes/:3.2.3]
at 
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.tableSchema$lzycompute(HoodieCatalogTable.scala:138)
 ~[classes/:3.2.3]
at 
org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.tableSchema(HoodieCatalogTable.scala:137)
 ~[classes/:3.2.3]
at 
org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.tableSchema$lzycompute(HoodieInternalV2Table.scala:57)
 ~[classes/:?]
at 
org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.tableSchema(HoodieInternalV2Table.scala:57)
 ~[classes/:?]
at 
org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.schema(HoodieInternalV2Table.scala:61)
 ~[classes/:?]
at 
org.apache.spark.sql.catalyst.analysis.ResolvedTable$.create(v2ResolutionPlans.scala:156)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.$anonfun$lookupTableOrView$1(Analyzer.scala:1232)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
   ...
   ```
   To prevent this issue, we produce catch and log exception when load table 
schema by meta client. 
   
   ### Impact
   Fix drop table error.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete

2023-10-27 Thread Wechar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wechar updated HUDI-6998:
-
Description: 
When schema evolution is enabled, hudi will load table as Spark datasource v2 
table, which need get table schema in Analyze stage.

We will first try to load schema from hoodie meta client since 
[HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw 
exception if table directory does not exists, in which case we are unable to 
drop this hudi table. 

> Fix drop table failure when load table as spark v2 table whose path is delete
> -
>
> Key: HUDI-6998
> URL: https://issues.apache.org/jira/browse/HUDI-6998
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>
> When schema evolution is enabled, hudi will load table as Spark datasource v2 
> table, which need get table schema in Analyze stage.
> We will first try to load schema from hoodie meta client since 
> [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw 
> exception if table directory does not exists, in which case we are unable to 
> drop this hudi table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete

2023-10-27 Thread Wechar (Jira)

Wechar created HUDI-6998:


 Summary: Fix drop table failure when load table as spark v2 table 
whose path is delete
 Key: HUDI-6998
 URL: https://issues.apache.org/jira/browse/HUDI-6998
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wechar
Assignee: Wechar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

82 matches

Mail list logo