Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783725463 ## CI report: * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN * 03c759528edb8ce1e0997b5b98921fa6343f1f22 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20535) * 69bed1eb6e45fda4b2380ae16f78004968570781 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783725481 ## CI report: * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537) * d59f64bdeb8cc5582a4fa6383dee98bf8b72a082 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783724053 ## CI report: * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783724077 ## CI report: * cef842213bef83d99e50a8babc2224c9904b22eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20536) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6495][RFC-66] Non-blocking Concurrency Control [hudi]
peanut-chenzhong commented on PR #7907: URL: https://github.com/apache/hudi/pull/7907#issuecomment-1783720492 for this feature, how do we handle the failure writing commits,will it be rollback by other writing tasks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179496 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, Schema>, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param olderOlder {@link HoodieSparkRecord}. + * @param oldSchemaOld schema. + * @param newerNewer {@link HoodieSparkRecord}. + * @param newSchemaNew schema. + * @param readerSchema Reader schema containing all the fields to read. + * @param propsConfiguration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + Schema readerSchema, + TypedProperties props) { +Pair, Pair> mergedSchemaPair = +getCachedMergedSchema(oldSchema, newSchema, readerSchema); +boolean isNewerPartial = isPartial(newSchema, mergedSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mergedSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +Integer ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); +if (ordInPartialUpdate != null) { + // pick from new + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mergedSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mergedSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public static
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179496 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, Schema>, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param olderOlder {@link HoodieSparkRecord}. + * @param oldSchemaOld schema. + * @param newerNewer {@link HoodieSparkRecord}. + * @param newSchemaNew schema. + * @param readerSchema Reader schema containing all the fields to read. + * @param propsConfiguration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + Schema readerSchema, + TypedProperties props) { +Pair, Pair> mergedSchemaPair = +getCachedMergedSchema(oldSchema, newSchema, readerSchema); +boolean isNewerPartial = isPartial(newSchema, mergedSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mergedSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +Integer ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); +if (ordInPartialUpdate != null) { + // pick from new + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mergedSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mergedSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public static
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783714779 ## CI report: * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 955944c19aa182a5231741fbf20888e517f6dafd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486) * 00c91e2e52ef18c5880de00c450ad059090efc7d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20537) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179372 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, Schema>, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param olderOlder {@link HoodieSparkRecord}. + * @param oldSchemaOld schema. + * @param newerNewer {@link HoodieSparkRecord}. + * @param newSchemaNew schema. + * @param readerSchema Reader schema containing all the fields to read. + * @param propsConfiguration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + Schema readerSchema, + TypedProperties props) { +Pair, Pair> mergedSchemaPair = +getCachedMergedSchema(oldSchema, newSchema, readerSchema); +boolean isNewerPartial = isPartial(newSchema, mergedSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mergedSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +Integer ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); +if (ordInPartialUpdate != null) { + // pick from new + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mergedSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mergedSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public static
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179246 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedFileGroupRecordBuffer.java: ## @@ -84,6 +79,10 @@ public void processDataBlock(HoodieDataBlock dataBlock, Option keySpecO isFullKey = keySpecOpt.get().isFullKey(); } +if (dataBlock.containsPartialUpdates()) { + enablePartialMerging = true; +} Review Comment: The buffer consumes multiple log blocks in the file group. When there are partial updates in one log block, we switch the merging mode and do not switch back going forward to guarantee correct merging results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375179085 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +48,9 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + default Option> partialMerge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, TypedProperties props) throws IOException { +throw new HoodieException("Partial merging logic is not implemented."); Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178992 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java: ## @@ -440,7 +443,9 @@ private static void validateRow(InternalRow data, StructType schema) { // corresponding schema has to be provided as well so that it could be properly // serialized (in case it would need to be) boolean isValid = data == null || data instanceof UnsafeRow -|| schema != null && (data instanceof HoodieInternalRow || SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data)); +|| schema != null && (data instanceof HoodieInternalRow +|| data instanceof GenericInternalRow Review Comment: The line is too long. I've revised the indentation to make it more clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1783713220 ## CI report: * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 955944c19aa182a5231741fbf20888e517f6dafd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486) * 00c91e2e52ef18c5880de00c450ad059090efc7d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178539 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java: ## @@ -76,6 +77,45 @@ public Option> merge(HoodieRecord older, Schema oldSc } } + @Override + public Option> partialMerge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, TypedProperties props) throws IOException { +ValidationUtils.checkArgument(older.getRecordType() == HoodieRecordType.SPARK); +ValidationUtils.checkArgument(newer.getRecordType() == HoodieRecordType.SPARK); + +if (newer instanceof HoodieSparkRecord) { + HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer; + if (newSparkRecord.isDeleted()) { +// Delete record +return Option.empty(); + } +} else { + if (newer.getData() == null) { +// Delete record +return Option.empty(); + } +} + +if (older instanceof HoodieSparkRecord) { + HoodieSparkRecord oldSparkRecord = (HoodieSparkRecord) older; + if (oldSparkRecord.isDeleted()) { +// use natural order for delete record +return Option.of(Pair.of(newer, newSchema)); + } +} else { + if (older.getData() == null) { +// use natural order for delete record +return Option.of(Pair.of(newer, newSchema)); + } +} +if (older.getOrderingValue(oldSchema, props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) { + return Option.of(SparkRecordMergingUtils.mergeCompleteOrPartialRecords( + (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, oldSchema, readerSchema, props)); Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178419 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param older Older {@link HoodieSparkRecord}. + * @param oldSchema Old schema. + * @param newer Newer {@link HoodieSparkRecord}. + * @param newSchema New schema. + * @param props Configuration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + TypedProperties props) { +Pair, Pair> mappingSchemaPair = +getCachedMergedSchema(oldSchema, newSchema); +boolean isNewerPartial = isPartial(newSchema, mappingSchemaPair.getRight().getRight()); Review Comment: A new `partialMerge` API is created so that partial merging logic is not called if not necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178329 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param older Older {@link HoodieSparkRecord}. + * @param oldSchema Old schema. + * @param newer Newer {@link HoodieSparkRecord}. + * @param newSchema New schema. + * @param props Configuration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + TypedProperties props) { +Pair, Pair> mappingSchemaPair = +getCachedMergedSchema(oldSchema, newSchema); +boolean isNewerPartial = isPartial(newSchema, mappingSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mappingSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +if (newPartialNameToIdMapping.containsKey(structField.name())) { + // pick from new + int ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mappingSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mappingSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public static Map getCachedFieldIdToFieldMapping(Schema avroSchema) { +return FIELD_ID_TO_FIELD_MAPPING_CACHE.computeIfAbsent(avroSchema, schema -> { + StructType structType = HoodieInternalRowUtil
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375178224 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -51,6 +61,28 @@ class SparkFileFormatInternalRowReaderContext(baseFileReader: PartitionedFile => requiredSchema: Schema, conf: Configuration): ClosableIterator[InternalRow] = { val fileInfo = sparkAdapter.getSparkPartitionedFileUtils.createPartitionedFile(partitionValues, filePath, start, length) -new CloseableInternalRowIterator(baseFileReader.apply(fileInfo)) +if (filePath.toString.contains(HoodieLogFile.DELTA_EXTENSION)) { Review Comment: `FsUtils.isLogFile` is fixed by considering `inlinefs://` scheme. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
CTTY commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1783709069 Hi @boneanxs , I'm working on Spark 3.5 support and noticed 2 new failures after cherry-picking your commit. I tried porting this changes for Spark 3.5 but it didn't help. I'm still looking into the test failures but would appreciate if you can help take a look as well! My PR: #9717 - Failure 1 ``` 2023-10-28T02:40:26.0783904Z - Test Create/Show/Drop Secondary Index *** FAILED *** 2023-10-28T02:40:26.0785667Z org.apache.spark.sql.AnalysisException: CreateIndex is not supported in this table default.h0. 2023-10-28T02:40:26.0787735Z at org.apache.spark.sql.errors.QueryCompilationErrors$.tableIndexNotSupportedError(QueryCompilationErrors.scala:3259) ``` - Failure 2 ``` T03:03:39.6074356Z - Test Create/Drop/Show/Refresh Index *** FAILED *** 2023-10-28T03:03:39.6077467Z java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.CreateIndex cannot be cast to org.apache.spark.sql.hudi.command.CreateIndexCommand 2023-10-28T03:03:39.6079990Z at org.apache.spark.sql.hudi.command.index.TestIndexSyntax.$anonfun$new$3(TestIndexSyntax.scala:67) 2023-10-28T03:03:39.6081642Z at scala.collection.immutable.List.foreach(List.scala:431) 2023-10-28T03:03:39.6143392Z at org.apache.spark.sql.hudi.command.index.TestIndexSyntax.$anonfun$new$2(TestIndexSyntax.scala:34) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783701937 ## CI report: * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530) * cef842213bef83d99e50a8babc2224c9904b22eb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20536) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6999) Add row writer support to Deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6999: - Labels: pull-request-available (was: ) > Add row writer support to Deltastreamer > --- > > Key: HUDI-6999 > URL: https://issues.apache.org/jira/browse/HUDI-6999 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > We have not yet leveraged row writer support in Deltastreamer. we can benefit > from perf improvement if we can integrate -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6999] Adding row writer support to HoodieStreamer [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783700200 ## CI report: * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530) * cef842213bef83d99e50a8babc2224c9904b22eb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375161773 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, Schema>, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param olderOlder {@link HoodieSparkRecord}. + * @param oldSchemaOld schema. + * @param newerNewer {@link HoodieSparkRecord}. + * @param newSchemaNew schema. + * @param readerSchema Reader schema containing all the fields to read. + * @param propsConfiguration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + Schema readerSchema, + TypedProperties props) { +Pair, Pair> mergedSchemaPair = +getCachedMergedSchema(oldSchema, newSchema, readerSchema); +boolean isNewerPartial = isPartial(newSchema, mergedSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mergedSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +Integer ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); +if (ordInPartialUpdate != null) { + // pick from new + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mergedSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mergedSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public st
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375161288 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/merge/SparkRecordMergingUtils.java: ## @@ -0,0 +1,189 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.merge; + +import org.apache.hudi.AvroConversionUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.model.HoodieSparkRecord; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; +import org.apache.spark.sql.HoodieInternalRowUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Util class to merge records that may contain partial updates. + * This can be plugged into any Spark {@link HoodieRecordMerger} implementation. + */ +public class SparkRecordMergingUtils { + private static final Map> FIELD_ID_TO_FIELD_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map> FIELD_NAME_TO_ID_MAPPING_CACHE = new ConcurrentHashMap<>(); + private static final Map, Schema>, + Pair, Pair>> MERGED_SCHEMA_CACHE = new ConcurrentHashMap<>(); + + /** + * Merges records which can contain partial updates. + * + * @param olderOlder {@link HoodieSparkRecord}. + * @param oldSchemaOld schema. + * @param newerNewer {@link HoodieSparkRecord}. + * @param newSchemaNew schema. + * @param readerSchema Reader schema containing all the fields to read. + * @param propsConfiguration in {@link TypedProperties}. + * @return The merged record. + */ + public static Pair mergeCompleteOrPartialRecords(HoodieSparkRecord older, + Schema oldSchema, + HoodieSparkRecord newer, + Schema newSchema, + Schema readerSchema, + TypedProperties props) { +Pair, Pair> mergedSchemaPair = +getCachedMergedSchema(oldSchema, newSchema, readerSchema); +boolean isNewerPartial = isPartial(newSchema, mergedSchemaPair.getRight().getRight()); +if (isNewerPartial) { + InternalRow oldRow = older.getData(); + InternalRow newPartialRow = newer.getData(); + + Map mergedIdToFieldMapping = mergedSchemaPair.getLeft(); + Map newPartialNameToIdMapping = getCachedFieldNameToIdMapping(newSchema); + List values = new ArrayList<>(mergedIdToFieldMapping.size()); + for (int fieldId = 0; fieldId < mergedIdToFieldMapping.size(); fieldId++) { +StructField structField = mergedIdToFieldMapping.get(fieldId); +Integer ordInPartialUpdate = newPartialNameToIdMapping.get(structField.name()); +if (ordInPartialUpdate != null) { + // pick from new + values.add(newPartialRow.get(ordInPartialUpdate, structField.dataType())); +} else { + // pick from old + values.add(oldRow.get(fieldId, structField.dataType())); +} + } + InternalRow mergedRow = new GenericInternalRow(values.toArray()); + + HoodieSparkRecord mergedSparkRecord = new HoodieSparkRecord( + mergedRow, mergedSchemaPair.getRight().getLeft()); + return Pair.of(mergedSparkRecord, mergedSchemaPair.getRight().getRight()); +} else { + return Pair.of(newer, newSchema); +} + } + + /** + * @param avroSchema Avro schema. + * @return The field ID to {@link StructField} instance mapping. + */ + public st
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375160946 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodiePositionBasedFileGroupRecordBuffer.java: ## @@ -84,6 +79,10 @@ public void processDataBlock(HoodieDataBlock dataBlock, Option keySpecO isFullKey = keySpecOpt.get().isFullKey(); } +if (dataBlock.containsPartialUpdates()) { + enablePartialMerging = true; +} Review Comment: Is there any chance this flag switch back to false? Can the buffer consumes multiple log blocks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6999) Add row writer support to Deltastreamer
sivabalan narayanan created HUDI-6999: - Summary: Add row writer support to Deltastreamer Key: HUDI-6999 URL: https://issues.apache.org/jira/browse/HUDI-6999 Project: Apache Hudi Issue Type: Improvement Components: deltastreamer Reporter: sivabalan narayanan We have not yet leveraged row writer support in Deltastreamer. we can benefit from perf improvement if we can integrate -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375160672 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +48,9 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + default Option> partialMerge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, TypedProperties props) throws IOException { +throw new HoodieException("Partial merging logic is not implemented."); Review Comment: throws `UnSupportedException`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375159973 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java: ## @@ -440,7 +443,9 @@ private static void validateRow(InternalRow data, StructType schema) { // corresponding schema has to be provided as well so that it could be properly // serialized (in case it would need to be) boolean isValid = data == null || data instanceof UnsafeRow -|| schema != null && (data instanceof HoodieInternalRow || SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data)); +|| schema != null && (data instanceof HoodieInternalRow +|| data instanceof GenericInternalRow Review Comment: The original indentation is more clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1375159611 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java: ## @@ -76,6 +77,45 @@ public Option> merge(HoodieRecord older, Schema oldSc } } + @Override + public Option> partialMerge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, Schema readerSchema, TypedProperties props) throws IOException { +ValidationUtils.checkArgument(older.getRecordType() == HoodieRecordType.SPARK); +ValidationUtils.checkArgument(newer.getRecordType() == HoodieRecordType.SPARK); + +if (newer instanceof HoodieSparkRecord) { + HoodieSparkRecord newSparkRecord = (HoodieSparkRecord) newer; + if (newSparkRecord.isDeleted()) { +// Delete record +return Option.empty(); + } +} else { + if (newer.getData() == null) { +// Delete record +return Option.empty(); + } +} + +if (older instanceof HoodieSparkRecord) { + HoodieSparkRecord oldSparkRecord = (HoodieSparkRecord) older; + if (oldSparkRecord.isDeleted()) { +// use natural order for delete record +return Option.of(Pair.of(newer, newSchema)); + } +} else { + if (older.getData() == null) { +// use natural order for delete record +return Option.of(Pair.of(newer, newSchema)); + } +} +if (older.getOrderingValue(oldSchema, props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) { + return Option.of(SparkRecordMergingUtils.mergeCompleteOrPartialRecords( + (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, oldSchema, readerSchema, props)); Review Comment: We can rename the method back to `mergePartialRecords`now ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783687707 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on PR #9933: URL: https://github.com/apache/hudi/pull/9933#issuecomment-1783686379 [6997.patch.zip](https://github.com/apache/hudi/files/13194360/6997.patch.zip) Thanks for the contribution, I have reviewed and created a patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783686118 ## CI report: * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN * 03c759528edb8ce1e0997b5b98921fa6343f1f22 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20535) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783686071 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20534) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783686047 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 6ada390928851770f175657a992b08269bcacb28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532) * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1783684492 ## CI report: * 78480b553a25033b4f642518c92402e6acd181b4 UNKNOWN * 5b39c6dc67dbe4273bfcc704f644c925e76f8fa2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20523) * 03c759528edb8ce1e0997b5b98921fa6343f1f22 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on code in PR #9933: URL: https://github.com/apache/hudi/pull/9933#discussion_r1375155933 ## hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java: ## @@ -35,9 +35,15 @@ public enum WriteConcurrencyMode { @EnumFieldDescription("Multiple writers can operate on the table with lazy conflict resolution " + "using locks. This means that only one writer succeeds if multiple writers write to the " + "same file group.") - OPTIMISTIC_CONCURRENCY_CONTROL; + OPTIMISTIC_CONCURRENCY_CONTROL, - public boolean supportsOptimisticConcurrencyControl() { -return this == OPTIMISTIC_CONCURRENCY_CONTROL; + // Multiple writer can perform write ops on a MOR table with simple bucket index, and defer conflict + // resolution to compaction phase + @EnumFieldDescription("Multiple writer can perform write ops on a MOR table with simple bucket index, " + + "and defer conflict resolution to compaction phase.") + NON_BLOCKING_CONCURRENCY_CONTROL; + + public boolean supportsConcurrencyControl() { +return this == OPTIMISTIC_CONCURRENCY_CONTROL || this == NON_BLOCKING_CONCURRENCY_CONTROL; Review Comment: `supportsConcurrencyControl()` -> `supportsMultiWriter()` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on code in PR #9933: URL: https://github.com/apache/hudi/pull/9933#discussion_r1375154861 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteCopyOnWrite.java: ## @@ -520,27 +523,39 @@ public void testWriteExactlyOnce() throws Exception { // case1: txn2's time range is involved in txn1 // |--- txn1 ---| // | - txn2 - | - @Test - public void testWriteMultiWriterInvolved() throws Exception { -conf.setString(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name()); + @ParameterizedTest + @ValueSource(strings = {"OPTIMISTIC_CONCURRENCY_CONTROL", "NON_BLOCKING_CONCURRENCY_CONTROL"}) Review Comment: You can use enumSource instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on code in PR #9933: URL: https://github.com/apache/hudi/pull/9933#discussion_r1375154698 ## hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java: ## @@ -35,9 +35,15 @@ public enum WriteConcurrencyMode { @EnumFieldDescription("Multiple writers can operate on the table with lazy conflict resolution " + "using locks. This means that only one writer succeeds if multiple writers write to the " + "same file group.") - OPTIMISTIC_CONCURRENCY_CONTROL; + OPTIMISTIC_CONCURRENCY_CONTROL, - public boolean supportsOptimisticConcurrencyControl() { -return this == OPTIMISTIC_CONCURRENCY_CONTROL; + // Multiple writer can perform write ops on a MOR table with simple bucket index, and defer conflict + // resolution to compaction phase + @EnumFieldDescription("Multiple writer can perform write ops on a MOR table with simple bucket index, " Review Comment: "Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automically by the query reader and the compactor." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on code in PR #9933: URL: https://github.com/apache/hudi/pull/9933#discussion_r1375153787 ## hudi-common/src/main/java/org/apache/hudi/common/model/WriteConcurrencyMode.java: ## @@ -35,9 +35,15 @@ public enum WriteConcurrencyMode { @EnumFieldDescription("Multiple writers can operate on the table with lazy conflict resolution " + "using locks. This means that only one writer succeeds if multiple writers write to the " + "same file group.") - OPTIMISTIC_CONCURRENCY_CONTROL; + OPTIMISTIC_CONCURRENCY_CONTROL, - public boolean supportsOptimisticConcurrencyControl() { -return this == OPTIMISTIC_CONCURRENCY_CONTROL; + // Multiple writer can perform write ops on a MOR table with simple bucket index, and defer conflict + // resolution to compaction phase Review Comment: No need to comment 2 times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783676321 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529) * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20534) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
danny0405 commented on code in PR #9933: URL: https://github.com/apache/hudi/pull/9933#discussion_r1375153204 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2638,19 +2638,23 @@ public Integer getWritesFileIdEncoding() { } public boolean needResolveWriteConflict(WriteOperationType operationType) { -if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { - // NB-CC don't need to resolve write conflict except bulk insert operation - return WriteOperationType.BULK_INSERT == operationType || !isNonBlockingConcurrencyControl(); -} else { - // SINGLE_WRITER case don't need to resolve write conflict - return false; +WriteConcurrencyMode mode = getWriteConcurrencyMode(); +boolean needResolveConflict; +switch (mode) { + case SINGLE_WRITER: +needResolveConflict = false; Review Comment: Return directly ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783674701 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529) * 7fed09478a1604de3d3dd3ad1a95f02b29951b46 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
nsivabalan commented on code in PR #9913: URL: https://github.com/apache/hudi/pull/9913#discussion_r1375149713 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java: ## @@ -584,10 +604,16 @@ private Pair>> fetchFromSourc (SchemaProvider) new DelegatingSchemaProvider(props, hoodieSparkContext.jsc(), dataAndCheckpoint.getSchemaProvider(), new SimpleSchemaProvider(hoodieSparkContext.jsc(), targetSchema, props))) .orElse(dataAndCheckpoint.getSchemaProvider()); +if (rowBulkInsert) { + return new InputBatch(transformed, checkpointStr, schemaProvider); Review Comment: we handle empty commits. wrt key gen, it will be handled in BaseDatasetBulkInsertCommitActionExecutor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
nsivabalan commented on code in PR #9913: URL: https://github.com/apache/hudi/pull/9913#discussion_r1375149768 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java: ## @@ -541,6 +558,9 @@ private Pair>> fetchFromSourc checkpointStr = dataAndCheckpoint.getCheckpointForNextBatch(); boolean reconcileSchema = props.getBoolean(DataSourceWriteOptions.RECONCILE_SCHEMA().key()); if (this.userProvidedSchemaProvider != null && this.userProvidedSchemaProvider.getTargetSchema() != null) { +if (rowBulkInsert) { + return new InputBatch(transformed, checkpointStr, this.userProvidedSchemaProvider); Review Comment: yes. main purpose of using row writer is for perf. but if we were to do this, might incur perf hit. So, in V0, I am punting error table for row writer use-cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783644465 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521) * 6ada390928851770f175657a992b08269bcacb28 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532) * ef1785580edd0f4514576ec261aa0d9a17f71aa4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20533) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783642064 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521) * 6ada390928851770f175657a992b08269bcacb28 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532) * ef1785580edd0f4514576ec261aa0d9a17f71aa4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783639371 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521) * 6ada390928851770f175657a992b08269bcacb28 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20532) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1783620053 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 27291b347c470e5b25358b976b430398f4aa3e5f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20521) * 6ada390928851770f175657a992b08269bcacb28 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition. [hudi]
bvaradar commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1783579700 @majian1998 : The conflict resolution happens within the context of LockManager transaction which should serialize the checking. From what I understand, the conflict resolution strategy needs to handle replace commits. We need to incorporate the checking similar to DeletePartitionUtils.checkForPendingTableServiceActions as part of conflict resolution. Currently, DeletePartitionUtils.checkForPendingTableServiceActions is being called outside of transaction scope causing this. Let me know if this makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] "OutOfMemoryError: Requested array size exceeds VM limit" on data ingestion to MOR table [hudi]
mzheng-plaid commented on issue #9934: URL: https://github.com/apache/hudi/issues/9934#issuecomment-1783573941 Is there a way to configure the pretty printer `JsonUtils.getObjectMapper().writerWithDefaultPrettyPrinter().writeValueAsString(this); ` to be more compact? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783420799 ## CI report: * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] "OutOfMemoryError: Requested array size exceeds VM limit" on data ingestion to MOR table [hudi]
mzheng-plaid opened a new issue, #9934: URL: https://github.com/apache/hudi/issues/9934 **Describe the problem you faced** We have a MOR table that is ingested to using a Spark Structured Streaming pipeline. We are seeing: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o355.save. : java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:292) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:243) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:126) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:701) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:345) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1$$Lambda$4086/588517446.apply(Unknown Source) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2384/2044625832.apply(Unknown Source) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2383/299085843.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2373/595359931.apply(Unknown Source) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:626) at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1362/580263940.apply(Unknown Source) ``` It seems like this is happening on commit (ie. its writing the data successfully) and each time it retries it has to rollback (and each rollback is getting more and more expensive). **To Reproduce** Unclear. **Expected behavior** We are not sure how to recover from this bad state. Is this loading in the `deltacommits` from the timeline and trying to create an array thats too large? Or is this stack trace indicating its a problem with the current batch (we've tried turning down the batch size with no change) EMR 6.10.1 * Hudi version : 0.12.2-amzn-0 * Spark version : 3.3.1 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : Spark on Docker **Additional context** The `.hoodie` path is below: ``` PRE .aux/ 2023-10-12 15:58:18 0 .aux_$folder$ 2023-10-12 15:58:17 0 .schema_$folder$ 2023-10-12 15:58:17 0 .temp_$folder$ 2023-10-23 11:56:36 13120 20231023185628734.deltacommit 2023-10-23 11:56:33786 20231023185628734.deltacommit.inflight 2023-10-23 11:56:30 0 20231023185628734.deltacommit.requested 2023-10
Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]
arunvasudevan commented on issue #9861: URL: https://github.com/apache/hudi/issues/9861#issuecomment-1783346605 @ad1happy2go Messaged you on Hudi Slack. We can connect more about this issue in slack, Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 (#9895)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 4f723fb57ec [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 (#9895) 4f723fb57ec is described below commit 4f723fb57ec3d859893d8eee80e1d2b6ceb05d18 Author: Rex(Hui) An AuthorDate: Sat Oct 28 02:14:57 2023 +0800 [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 (#9895) CreateIndex is added in [HUDI-4165](https://github.com/apache/hudi/pull/5761/files), and spark 3.3 also include this in [SPARK-36895](https://github.com/apache/spark/pull/34148). Since `CreateIndex` uses same package `org.apache.spark.sql.catalyst.plans.logical` in HUDI and Spark3.3, but params are not same. So it could introduce class conflict issues if we use it. This commit still keeps the same package path with Spark, but changes to 1. Use the same params like Spark, so there should be no class conflict 2. Only support Index related commands from **Spark3.2**, since Spark2 doesn't have `org.apache.spark.sql.catalyst.analysis.FieldName` but `CreateIndex` requires 3. Resolve columns for CreateIndex during Analyze stage --- .../spark/sql/HoodieCatalystPlansUtils.scala | 20 ++- .../hudi/spark/sql/parser/HoodieSqlCommon.g4 | 68 -- .../spark/sql/hudi/analysis/HoodieAnalysis.scala | 37 -- .../spark/sql/hudi/command/IndexCommands.scala | 26 +--- .../sql/parser/HoodieSqlCommonAstBuilder.scala | 142 + .../sql/hudi/command/index/TestIndexSyntax.scala | 97 +++--- .../hudi/command/index/TestSecondaryIndex.scala| 108 .../spark/sql/HoodieSpark2CatalystPlanUtils.scala | 13 ++ .../spark/sql/HoodieSpark30CatalystPlanUtils.scala | 13 +- .../spark/sql/HoodieSpark31CatalystPlanUtils.scala | 13 +- .../src/main/antlr4/imports/SqlBase.g4 | 32 + .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4 | 7 + .../spark/sql/HoodieSpark32CatalystPlanUtils.scala | 38 +- .../HoodieSpark3_2ExtendedSqlAstBuilder.scala | 139 .../parser/HoodieSpark3_2ExtendedSqlParser.scala | 6 +- .../spark/sql/catalyst/plans/logical/Index.scala | 46 ++- .../hudi/analysis/HoodieSpark32PlusAnalysis.scala | 38 +- .../src/main/antlr4/imports/SqlBase.g4 | 32 + .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4 | 7 + .../spark/sql/HoodieSpark33CatalystPlanUtils.scala | 38 +- .../HoodieSpark3_3ExtendedSqlAstBuilder.scala | 139 .../parser/HoodieSpark3_3ExtendedSqlParser.scala | 6 +- .../src/main/antlr4/imports/SqlBase.g4 | 32 + .../apache/hudi/spark/sql/parser/HoodieSqlBase.g4 | 7 + .../spark/sql/HoodieSpark34CatalystPlanUtils.scala | 38 +- .../HoodieSpark3_4ExtendedSqlAstBuilder.scala | 139 .../parser/HoodieSpark3_4ExtendedSqlParser.scala | 6 +- 27 files changed, 909 insertions(+), 378 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala index 9cfe23f86cc..64ee645ba0f 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala @@ -22,7 +22,6 @@ import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression} import org.apache.spark.sql.catalyst.plans.JoinType import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} -import org.apache.spark.sql.execution.datasources.HadoopFsRelation import org.apache.spark.sql.internal.SQLConf trait HoodieCatalystPlansUtils { @@ -79,6 +78,25 @@ trait HoodieCatalystPlansUtils { */ def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, LogicalPlan, Expression)] + /** + * Decomposes [[MatchCreateIndex]] into its arguments with accommodation. + */ + def unapplyCreateIndex(plan: LogicalPlan): Option[(LogicalPlan, String, String, Boolean, Seq[(Seq[String], Map[String, String])], Map[String, String])] + + /** + * Decomposes [[MatchDropIndex]] into its arguments with accommodation. + */ + def unapplyDropIndex(plan: LogicalPlan): Option[(LogicalPlan, String, Boolean)] + + /** + * Decomposes [[MatchShowIndexes]] into its arguments with accommodation. + */ + def unapplyShowIndexes(plan: LogicalPlan): Option[(LogicalPlan, Seq[Attribute])] + + /** + * Decomposes [[MatchRefreshIndex]] into its arguments with accommodation. + */ + def unapplyRefreshIndex(plan: LogicalPlan): Option[(LogicalPlan, String)] /**
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
yihua merged PR #9895: URL: https://github.com/apache/hudi/pull/9895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6977) Upgrade hadoop version from 2.10.1 to 2.10.2
[ https://issues.apache.org/jira/browse/HUDI-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6977. --- Resolution: Fixed > Upgrade hadoop version from 2.10.1 to 2.10.2 > > > Key: HUDI-6977 > URL: https://issues.apache.org/jira/browse/HUDI-6977 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > https://github.com/apache/hudi/pull/9768#discussion_r1370266395 > This release contains important security fixes for log4shell > [https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-common/release/2.10.2/RELEASENOTES.2.10.2.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783299568 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20529) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783299911 ## CI report: * c24e21a247718190cf8796e30fd837aa1eb465a6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20512) * 01519ff2eb856bbdd487162a21df52d8a7c216cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20530) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
hudi-bot commented on PR #9913: URL: https://github.com/apache/hudi/pull/9913#issuecomment-1783289891 ## CI report: * c24e21a247718190cf8796e30fd837aa1eb465a6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20512) * 01519ff2eb856bbdd487162a21df52d8a7c216cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1783289502 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 43f9d0bbc914d6bd09bffe1285de8a51960cc979 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20517) * 56ed4a1d63a9cfb770eabe7be5d2506e41f37f90 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
hudi-bot commented on PR #9933: URL: https://github.com/apache/hudi/pull/9933#issuecomment-1783224181 ## CI report: * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20528) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] first draft [hudi]
jonvex commented on code in PR #9913: URL: https://github.com/apache/hudi/pull/9913#discussion_r1374684223 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java: ## @@ -408,19 +420,25 @@ public Pair, JavaRDD> syncOnce() throws IOException .build(); String instantTime = metaClient.createNewInstantTime(); -Pair>> srcRecordsWithCkpt = readFromSource(instantTime); +InputBatch inputBatch = readFromSource(instantTime); + +if (inputBatch != null) { + final JavaRDD recordsFromSource; + if (rowBulkInsert) { Review Comment: I think it's too much for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
hudi-bot commented on PR #9933: URL: https://github.com/apache/hudi/pull/9933#issuecomment-1782963803 ## CI report: * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20528) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6997] Introduce a new WriteConcurrencyMode type for non-blocking concurrency control [hudi]
hudi-bot commented on PR #9933: URL: https://github.com/apache/hudi/pull/9933#issuecomment-1782950737 ## CI report: * 6b387c7ea618d5b9047ea8261a4d5fe90c5098cd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6997) Create a new WriteConcurrencyMode type for non-blocking concurrency control
[ https://issues.apache.org/jira/browse/HUDI-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6997: - Labels: pull-request-available (was: ) > Create a new WriteConcurrencyMode type for non-blocking concurrency control > --- > > Key: HUDI-6997 > URL: https://issues.apache.org/jira/browse/HUDI-6997 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > > Currently, non-blocking concurrency control is enabled if a job config > satisfy 'OCC + MOR + simple bucket index'. > It's a litter confusing and might lead to unexpected behavior under some > cases. > It's better to create a new `WriteConcurrencyMode` type for non-blocking > concurrency control . -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]
Armelabdelkbir commented on issue #9918: URL: https://github.com/apache/hudi/issues/9918#issuecomment-1782829269 @ad1happy2go my metadata is disabled in version 0.11.0: "hoodie.metadata.enable" -> "false" , currently I can't upgrade until the client migration is complete. by waiting if i have empty parquets problems i need just to delete them ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]
hudi-bot commented on PR #9932: URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782771021 ## CI report: * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20527) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]
ksmou commented on code in PR #9925: URL: https://github.com/apache/hudi/pull/9925#discussion_r1374450011 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig { + "value will let the clustering job run faster, while it will give additional pressure to the " + "execution engines to manage more concurrent running jobs."); + public static final ConfigProperty CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty + .key("hoodie.clustering.read.records.parallelism") + .defaultValue(20) Review Comment: > We already have a `hoodie.clustering.max.parallelism` to control how many clustering jobs to submit, now this param looks to control the parallelism when reading per group, and it only works when row writer is disabled. This configuration name still confuses me. > > Is there any other way to avoid a new configuration? What abt we directly use `clusteringGroup.getNumOutputFileGroups` Yes. it limits the read parallelism for single clutering group. `clusteringGroup.getNumOutputFileGroups` is 2 default, it's too small to read a 1g file. Maybe we can use `hoodie.clustering.rdd.read.parallelism`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6894] ReflectionUtils is not thread safe [hudi]
danny0405 commented on code in PR #9786: URL: https://github.com/apache/hudi/pull/9786#discussion_r1337998794 ## hudi-common/src/main/java/org/apache/hudi/common/util/ReflectionUtils.java: ## @@ -45,7 +45,7 @@ public class ReflectionUtils { private static final Logger LOG = LoggerFactory.getLogger(ReflectionUtils.class); - private static final Map> CLAZZ_CACHE = new HashMap<>(); + private static final Map> CLAZZ_CACHE = new ConcurrentHashMap<>(); Review Comment: Nice advice~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
hudi-bot commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782691713 ## CI report: * f462cf7b36016948969adb5083057a9ecbd6f794 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20525) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]
danny0405 commented on code in PR #9925: URL: https://github.com/apache/hudi/pull/9925#discussion_r1374384477 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig { + "value will let the clustering job run faster, while it will give additional pressure to the " + "execution engines to manage more concurrent running jobs."); + public static final ConfigProperty CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty + .key("hoodie.clustering.read.records.parallelism") + .defaultValue(20) Review Comment: Yeah, it looks like the read parallelism for a single `HoodieClusteringGroup`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6990] Configurable clustering task parallelism [hudi]
boneanxs commented on code in PR #9925: URL: https://github.com/apache/hudi/pull/9925#discussion_r1374351423 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java: ## @@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig { + "value will let the clustering job run faster, while it will give additional pressure to the " + "execution engines to manage more concurrent running jobs."); + public static final ConfigProperty CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty + .key("hoodie.clustering.read.records.parallelism") + .defaultValue(20) Review Comment: We already have a `hoodie.clustering.max.parallelism` to control how many clustering jobs to submit, now this param looks to control the parallelism when reading per group, and it only works when row writer is disabled. This configuration name still confuses me. Is there any other way to avoid a new configuration? What abt we directly use `clusteringGroup.getNumOutputFileGroups` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi deltastreamer jsonkafka source schema registry fail [hudi]
ad1happy2go commented on issue #9132: URL: https://github.com/apache/hudi/issues/9132#issuecomment-1782637536 @nttq1sub Were you able to resolve this issue? Feel free to close or let us know if you still have issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Upsert operations end up with duplicate data. Range Pruning not working properly with column statistics [hudi]
xicm commented on issue #9870: URL: https://github.com/apache/hudi/issues/9870#issuecomment-1782592138 Hi @ssandona, this issue is caused by Complex key generator, if PARTITION_FIELD is not a single field, the key generrator will be Complex, and there is a bug when dealing with complex keys. `hoodie.table.keygenerator.class` in your hoodie.properties may be SimpleKeyGenerator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Index Bootstrap deleted some snapshot data that has been batch-inserted into Hudi ? [hudi]
imrewang closed issue #9513: [SUPPORT]Index Bootstrap deleted some snapshot data that has been batch-inserted into Hudi ? URL: https://github.com/apache/hudi/issues/9513 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
hudi-bot commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782531655 ## CI report: * f462cf7b36016948969adb5083057a9ecbd6f794 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20525) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]
hudi-bot commented on PR #9932: URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782471159 ## CI report: * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20527) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
hudi-bot commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1782470903 ## CI report: * f462cf7b36016948969adb5083057a9ecbd6f794 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]
hudi-bot commented on PR #9932: URL: https://github.com/apache/hudi/pull/9932#issuecomment-1782459555 ## CI report: * ecb0c5b114b55f86e02473ffd88202b56a3b7c36 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete
[ https://issues.apache.org/jira/browse/HUDI-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6998: - Labels: pull-request-available (was: ) > Fix drop table failure when load table as spark v2 table whose path is delete > - > > Key: HUDI-6998 > URL: https://issues.apache.org/jira/browse/HUDI-6998 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Wechar >Priority: Major > Labels: pull-request-available > > When schema evolution is enabled, hudi will load table as Spark datasource v2 > table, which need get table schema in Analyze stage. > We will first try to load schema from hoodie meta client since > [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw > exception if table directory does not exists, in which case we are unable to > drop this hudi table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6998] Fix drop table failure when load table as spark v2 table whose path is delete [hudi]
wecharyu opened a new pull request, #9932: URL: https://github.com/apache/hudi/pull/9932 ### Change Logs We will first try to load schema from hoodie meta client since [HUDI-6219](https://issues.apache.org/jira/browse/HUDI-6219), it will throw exception if table directory does not exists, in which case we are unable to drop this hudi table. ```bash 20887 [ScalaTest-run-running-TestDropTable] WARN org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable [] - Failed to load table schema from meta client. org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path file:/private/var/folders/6j/vkbrdtgd0x34sqjzgv7s8ckhgy/T/spark-8a41e162-dfab-4843-afcd-bb672628747d/h0/.hoodie at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:57) ~[classes/:?] at org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:149) ~[classes/:?] at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:734) ~[classes/:?] at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:91) ~[classes/:?] at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:825) ~[classes/:?] at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.metaClient$lzycompute(HoodieCatalogTable.scala:85) ~[classes/:3.2.3] at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.metaClient(HoodieCatalogTable.scala:83) ~[classes/:3.2.3] at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.loadTableSchemaByMetaClient(HoodieCatalogTable.scala:322) ~[classes/:3.2.3] at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.tableSchema$lzycompute(HoodieCatalogTable.scala:138) ~[classes/:3.2.3] at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.tableSchema(HoodieCatalogTable.scala:137) ~[classes/:3.2.3] at org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.tableSchema$lzycompute(HoodieInternalV2Table.scala:57) ~[classes/:?] at org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.tableSchema(HoodieInternalV2Table.scala:57) ~[classes/:?] at org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table.schema(HoodieInternalV2Table.scala:61) ~[classes/:?] at org.apache.spark.sql.catalyst.analysis.ResolvedTable$.create(v2ResolutionPlans.scala:156) ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.$anonfun$lookupTableOrView$1(Analyzer.scala:1232) ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] ... ``` To prevent this issue, we produce catch and log exception when load table schema by meta client. ### Impact Fix drop table error. ### Risk level (write none, low medium or high below) None. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete
[ https://issues.apache.org/jira/browse/HUDI-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wechar updated HUDI-6998: - Description: When schema evolution is enabled, hudi will load table as Spark datasource v2 table, which need get table schema in Analyze stage. We will first try to load schema from hoodie meta client since [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw exception if table directory does not exists, in which case we are unable to drop this hudi table. > Fix drop table failure when load table as spark v2 table whose path is delete > - > > Key: HUDI-6998 > URL: https://issues.apache.org/jira/browse/HUDI-6998 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Wechar >Priority: Major > > When schema evolution is enabled, hudi will load table as Spark datasource v2 > table, which need get table schema in Analyze stage. > We will first try to load schema from hoodie meta client since > [HUDI-6219|https://issues.apache.org/jira/browse/HUDI-6219], it will throw > exception if table directory does not exists, in which case we are unable to > drop this hudi table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6998) Fix drop table failure when load table as spark v2 table whose path is delete
Wechar created HUDI-6998: Summary: Fix drop table failure when load table as spark v2 table whose path is delete Key: HUDI-6998 URL: https://issues.apache.org/jira/browse/HUDI-6998 Project: Apache Hudi Issue Type: Bug Reporter: Wechar Assignee: Wechar -- This message was sent by Atlassian Jira (v8.20.10#820010)