[GitHub] [hudi] KnightChess closed pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs
KnightChess closed pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs URL: https://github.com/apache/hudi/pull/9158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess opened a new pull request, #9158: [MiNOR] Unpersist only relevent metadata table RDDs
KnightChess opened a new pull request, #9158: URL: https://github.com/apache/hudi/pull/9158 ### Change Logs #7914 only remove base table cached rdd, if open mdt, mdt cached rdd need unpersist too. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit
hudi-bot commented on PR #9157: URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628279986 ## CI report: * d3a74b072d6c1d55f5412946aa4ba75a8c54076a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18434) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1628105901 ## CI report: * debb29aa63e665eed1137973fbe94eee395dc286 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18374) * e9fb2f74b5333e6f8f60de11ea6adce497252648 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18436) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8054: [HUDI-5854]Support flink table column comment
danny0405 commented on code in PR #8054: URL: https://github.com/apache/hudi/pull/8054#discussion_r1257693948 ## hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkDataSourceTableUtils.java: ## @@ -34,13 +35,21 @@ import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY; public class SparkDataSourceTableUtils { + + public static Map getSparkTableProperties(List partitionNames, String sparkVersion, +int schemaLengthThreshold, MessageType schema) { +return getSparkTableProperties(partitionNames, sparkVersion, schemaLengthThreshold, schema, Collections.emptyMap()); + } + /** * Get Spark Sql related table properties. This is used for spark datasource table. - * @param schema The schema to write to the table. + * + * @param schema The schema to write to the table. * @return A new parameters added the spark's table properties. */ public static Map getSparkTableProperties(List partitionNames, String sparkVersion, -int schemaLengthThreshold, MessageType schema) { +int schemaLengthThreshold, MessageType schema, +Map commentMap) { Review Comment: Where does the new method be used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8054: [HUDI-5854]Support flink table column comment
danny0405 commented on code in PR #8054: URL: https://github.com/apache/hudi/pull/8054#discussion_r1257693688 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HiveSchemaUtils.java: ## @@ -176,45 +177,6 @@ private static DataType toFlinkPrimitiveType(PrimitiveTypeInfo hiveType) { } } - /** - * Create Hive field schemas from Flink table schema including the hoodie metadata fields. - */ - public static List toHiveFieldSchema(TableSchema schema, boolean withOperationField) { -List columns = new ArrayList<>(); Review Comment: Can we modify directly on these methods instead of removing and adding new one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1628092201 ## CI report: * debb29aa63e665eed1137973fbe94eee395dc286 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18374) * e9fb2f74b5333e6f8f60de11ea6adce497252648 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9133: [HUDI-6474] Added support for reading tables evolved using comprehensive schema e…
danny0405 commented on code in PR #9133: URL: https://github.com/apache/hudi/pull/9133#discussion_r1257689584 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/CastMap.java: ## @@ -165,21 +192,132 @@ void add(int pos, LogicalType fromType, LogicalType toType) { } break; } + case ARRAY: { +if (from == ARRAY) { + LogicalType fromElementType = fromType.getChildren().get(0); + LogicalType toElementType = toType.getChildren().get(0); + return array -> doArrayConversion((ArrayData) array, fromElementType, toElementType); +} +break; + } + case MAP: { +if (from == MAP) { + return map -> doMapConversion((MapData) map, fromType, toType); +} +break; + } + case ROW: { +if (from == ROW) { + // Assumption: InternalSchemaManager should produce a cast that is of the same size + return row -> doRowConversion((RowData) row, fromType, toType); +} +break; + } default: } -return null; +throw new IllegalArgumentException(String.format("Unsupported conversion for %s => %s", fromType, toType)); } - private void add(int pos, Cast cast) { -castMap.put(pos, cast); + /** + * Helper function to perform convert an arrayData from one LogicalType to another. + * + * @param arrayNon-null array data to be converted; however array-elements are allowed to be null + * @param fromType The input LogicalType of the row data to be converted from + * @param toType The output LogicalType of the row data to be converted to + * @return Converted array that has the structure/specifications of that defined by the output LogicalType + */ + private static ArrayData doArrayConversion(@Nonnull ArrayData array, LogicalType fromType, LogicalType toType) { +// using Object type here as primitives are not allowed to be null +Object[] objects = new Object[array.size()]; +for (int i = 0; i < array.size(); i++) { + Object fromObject = ArrayData.createElementGetter(fromType).getElementOrNull(array, i); + // need to handle nulls to prevent NullPointerException in #getConversion() + Object toObject = fromObject != null ? getConversion(fromType, toType).apply(fromObject) : null; + objects[i] = toObject; +} +return new GenericArrayData(objects); + } + + /** + * Helper function to perform convert a MapData from one LogicalType to another. + * + * @param map Non-null map data to be converted; however, values are allowed to be null + * @param fromType The input LogicalType of the row data to be converted from + * @param toType The output LogicalType of the row data to be converted to + * @return Converted map that has the structure/specifications of that defined by the output LogicalType + */ + private static MapData doMapConversion(@Nonnull MapData map, LogicalType fromType, LogicalType toType) { +// no schema evolution is allowed on the keyType, hence, we only need to care about the valueType +LogicalType fromValueType = fromType.getChildren().get(1); +LogicalType toValueType = toType.getChildren().get(1); +LogicalType keyType = fromType.getChildren().get(0); + +final Map result = new HashMap<>(); +for (int i = 0; i < map.size(); i++) { + Object keyObject = ArrayData.createElementGetter(keyType).getElementOrNull(map.keyArray(), i); + Object fromObject = ArrayData.createElementGetter(fromValueType).getElementOrNull(map.valueArray(), i); + // need to handle nulls to prevent NullPointerException in #getConversion() + Object toObject = fromObject != null ? getConversion(fromValueType, toValueType).apply(fromObject) : null; + result.put(keyObject, toObject); +} +return new GenericMapData(result); + } + + /** + * Helper function to perform convert a RowData from one LogicalType to another. + * + * @param row Non-null row data to be converted; however, fields might contain nulls + * @param fromType The input LogicalType of the row data to be converted from + * @param toType The output LogicalType of the row data to be converted to + * @return Converted row that has the structure/specifications of that defined by the output LogicalType + */ + private static RowData doRowConversion(@Nonnull RowData row, LogicalType fromType, LogicalType toType) { +// note: InternalSchema.merge guarantees that the schema to be read fromType is orientated in the same order as toType +// hence, we can match types by position as it is guaranteed that it is referencing the same field +List fromChildren = fromType.getChildren(); +List toChildren = toType.getChildren(); +ValidationUtils.checkArgument(fromChildren.size() == toChildren.size(), +"fromType [" + fromType + "] size: != toType [" + toType + "] size"); + +
[GitHub] [hudi] danny0405 commented on a diff in pull request #9133: [HUDI-6474] Added support for reading tables evolved using comprehensive schema e…
danny0405 commented on code in PR #9133: URL: https://github.com/apache/hudi/pull/9133#discussion_r1255288699 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/CastMap.java: ## @@ -165,21 +192,132 @@ void add(int pos, LogicalType fromType, LogicalType toType) { } break; } + case ARRAY: { +if (from == ARRAY) { + LogicalType fromElementType = fromType.getChildren().get(0); + LogicalType toElementType = toType.getChildren().get(0); + return array -> doArrayConversion((ArrayData) array, fromElementType, toElementType); +} +break; + } + case MAP: { +if (from == MAP) { + return map -> doMapConversion((MapData) map, fromType, toType); +} +break; + } + case ROW: { +if (from == ROW) { + // Assumption: InternalSchemaManager should produce a cast that is of the same size + return row -> doRowConversion((RowData) row, fromType, toType); +} +break; + } default: } -return null; +throw new IllegalArgumentException(String.format("Unsupported conversion for %s => %s", fromType, toType)); Review Comment: Do not throw `RuntimeException` in nested calling code path, it is very obscure for the invoker to get the perception of exceptions. Either throws a checked exception or return null as of before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9118: [HUDI-2141] Support flink write metrics
danny0405 commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1628084953 Thanks for the contribution @stream2000 , can we add list a table of metrics we support here, illustrating the usage of each metrics, and then we are clear whether it deserves to introduce the metrics. Another metrics I got from feedback is the compaction progress metrics. Do we have plan to support that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle
danny0405 commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628072273 > because we can make sure that the partition records is sorted. In the near future, the non-sorted partition code may be introduced, we can keep it for extensibility now I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9126: [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy
danny0405 commented on code in PR #9126: URL: https://github.com/apache/hudi/pull/9126#discussion_r1257684369 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedAndBoundedIOCompactionStrategy.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.compact.strategy; + +import org.apache.hudi.avro.model.HoodieCompactionOperation; +import org.apache.hudi.avro.model.HoodieCompactionPlan; +import org.apache.hudi.config.HoodieWriteConfig; + +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class DayBasedAndBoundedIOCompactionStrategy extends DayBasedCompactionStrategy { + + protected static Comparator logSizeComparator = (op1, op2) -> { +Long logSize1 = op1.getMetrics().get(TOTAL_LOG_FILE_SIZE).longValue(); Review Comment: private static final Comparator LOG_SIZE_COMPARATOR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource
danny0405 commented on issue #9143: URL: https://github.com/apache/hudi/issues/9143#issuecomment-1628055348 cc @ad1happy2go seems this issue has been fixed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle
stream2000 commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628043527 > [6515.patch.zip](https://github.com/apache/hudi/files/11997469/6515.patch.zip) Nice catch, does this fix makes sense to you? Yes this fix is ok and make less modification. I make more changes because we can make sure that the partition records is sorted. I'm OK with both approaches, what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle
danny0405 commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1628023795 [6515.patch.zip](https://github.com/apache/hudi/files/11997469/6515.patch.zip) Nice catch, does this fix makes sense to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6316] Adding corrupted and rollback log blocks metrics (#8881)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a0ca3eadba5 [HUDI-6316] Adding corrupted and rollback log blocks metrics (#8881) a0ca3eadba5 is described below commit a0ca3eadba5a041aaaf97d040adffc017cfdd285 Author: Sivabalan Narayanan AuthorDate: Sun Jul 9 23:10:43 2023 -0400 [HUDI-6316] Adding corrupted and rollback log blocks metrics (#8881) Adding log block metrics to track corrupted lock blocks and rollback blocks. Users need to enable `hoodie.metrics.compaction.log.blocks.on` to enable the metrics. --- .../org/apache/hudi/config/HoodieWriteConfig.java | 7 ++ .../hudi/config/metrics/HoodieMetricsConfig.java | 11 +++ .../org/apache/hudi/metrics/HoodieMetrics.java | 79 ++ .../org/apache/hudi/metrics/TestHoodieMetrics.java | 41 ++- .../hudi/common/model/HoodieCommitMetadata.java| 20 ++ 5 files changed, 113 insertions(+), 45 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index 93105491180..1390408d901 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -2088,6 +2088,13 @@ public class HoodieWriteConfig extends HoodieConfig { return getBoolean(HoodieMetricsConfig.TURN_METRICS_ON); } + /** + * metrics properties. + */ + public boolean isCompactionLogBlockMetricsOn() { +return getBoolean(HoodieMetricsConfig.TURN_METRICS_COMPACTION_LOG_BLOCKS_ON); + } + public boolean isExecutorMetricsEnabled() { return Boolean.parseBoolean( getStringOrDefault(HoodieMetricsConfig.EXECUTOR_METRICS_ENABLE, "false")); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java index 9fe9b33a546..e1d0afeb6fa 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java @@ -106,6 +106,12 @@ public class HoodieMetricsConfig extends HoodieConfig { .sinceVersion("0.14.0") .withDocumentation("Comma separated list of config file paths for metric exporter configs"); + public static final ConfigProperty TURN_METRICS_COMPACTION_LOG_BLOCKS_ON = ConfigProperty + .key(METRIC_PREFIX + "compaction.log.blocks.on") + .defaultValue(false) + .sinceVersion("0.14.0") + .withDocumentation("Turn on/off metrics reporting for log blocks with compaction commit. off by default."); + /** * @deprecated Use {@link #TURN_METRICS_ON} and its methods instead */ @@ -171,6 +177,11 @@ public class HoodieMetricsConfig extends HoodieConfig { return this; } +public Builder compactionLogBlocksEnable(boolean compactionLogBlockMetricsEnable) { + hoodieMetricsConfig.setValue(TURN_METRICS_COMPACTION_LOG_BLOCKS_ON, String.valueOf(compactionLogBlockMetricsEnable)); + return this; +} + public Builder withReporterType(String reporterType) { hoodieMetricsConfig.setValue(METRICS_REPORTER_TYPE_VALUE, reporterType); return this; diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java index dac680a5c40..792d0cd0844 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java @@ -37,6 +37,23 @@ public class HoodieMetrics { private static final Logger LOG = LoggerFactory.getLogger(HoodieMetrics.class); + public static final String TOTAL_PARTITIONS_WRITTEN_STR = "totalPartitionsWritten"; + public static final String TOTAL_FILES_INSERT_STR = "totalFilesInsert"; + public static final String TOTAL_FILES_UPDATE_STR = "totalFilesUpdate"; + public static final String TOTAL_RECORDS_WRITTEN_STR = "totalRecordsWritten"; + public static final String TOTAL_UPDATE_RECORDS_WRITTEN_STR = "totalUpdateRecordsWritten"; + public static final String TOTAL_INSERT_RECORDS_WRITTEN_STR = "totalInsertRecordsWritten"; + public static final String TOTAL_BYTES_WRITTEN_STR = "totalBytesWritten"; + public static final String TOTAL_SCAN_TIME_STR = "totalScanTime"; + public static final String TOTAL_CREATE_TIME_STR = "totalCreateTime"; + public static final String
[GitHub] [hudi] codope merged pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics
codope merged PR #8881: URL: https://github.com/apache/hudi/pull/8881 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit
hudi-bot commented on PR #9157: URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628014896 ## CI report: * d3a74b072d6c1d55f5412946aa4ba75a8c54076a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18434) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9138: [SUPPORT]Read Hive then write Hudi,but org.apache.hadoop.hive.ql.metadata.AuthorizationException: No privilege 'Update' found for outputs
danny0405 commented on issue #9138: URL: https://github.com/apache/hudi/issues/9138#issuecomment-1628004963 Seems an issue of privilege, did you try to grant the permission with cmd: `grant xxx_action on database xxx_database to user xxx_user;` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9157: [HUDI-6246] Adding tests to validate restore of compaction commit
hudi-bot commented on PR #9157: URL: https://github.com/apache/hudi/pull/9157#issuecomment-1628004052 ## CI report: * d3a74b072d6c1d55f5412946aa4ba75a8c54076a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1628003363 ## CI report: * 1fc455198964e104536b807792dde341c960de88 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18433) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9154: [SUPPORT] org.apache.hudi.exception.HoodieIOException: Unable to create
danny0405 commented on issue #9154: URL: https://github.com/apache/hudi/issues/9154#issuecomment-1628001354 Did you configure the spillable map path explicitly? Wondering if it is related with the disk space or something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics
hudi-bot commented on PR #8881: URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627996896 ## CI report: * aa126962099394261801fe5fbf679bd192033f80 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18431) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257660429 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: Yeah, kind of obscure from the first sight, why not just put the default value `hoodie.merge.allow.duplicate.on.inserts` as true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9153: [HUDI-6512] Check whether exists write error before commit compaction
danny0405 commented on code in PR #9153: URL: https://github.com/apache/hudi/pull/9153#discussion_r1257656038 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java: ## @@ -291,6 +287,13 @@ private void completeClustering(HoodieReplaceCommitMetadata metadata, LOG.info("Clustering successfully on commit " + clusteringCommitTime); } + private void handleWriteErrors(List writeStats, TableServiceType tableServiceType) { +if (writeStats.stream().mapToLong(HoodieWriteStat::getTotalWriteErrors).sum() > 0) { + throw new HoodieClusteringException(tableServiceType + " failed to write to files:" + + writeStats.stream().filter(s -> s.getTotalWriteErrors() > 0L).map(HoodieWriteStat::getFileId).collect(Collectors.joining(","))); Review Comment: wondering whether we need a config option to control the behavior? If all the exceptions are resolvable, it's okay we throw exceptions directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #9157: [HUDI-6246] Adding tests to validate restore of compaction commit
nsivabalan opened a new pull request, #9157: URL: https://github.com/apache/hudi/pull/9157 ### Change Logs Adding tests to validate restore of compaction commit ### Impact Adding tests to validate restore of compaction commit ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627893231 ## CI report: * beb523cca98b4b62964c485e464d5eb1ce6e25a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18432) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1627870285 ## CI report: * 6250cd0f2bfe2ba9c3b3053940f2a75be78c2f98 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17041) * 1fc455198964e104536b807792dde341c960de88 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18433) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1627867274 ## CI report: * 6250cd0f2bfe2ba9c3b3053940f2a75be78c2f98 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17041) * 1fc455198964e104536b807792dde341c960de88 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
nsivabalan commented on code in PR #8697: URL: https://github.com/apache/hudi/pull/8697#discussion_r1257563336 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -142,24 +144,27 @@ trait ProvidesHoodieConfig extends Logging { // we'd prefer that value over auto-deduced operation. Otherwise, we deduce target operation type val operationOverride = combinedOpts.get(DataSourceWriteOptions.OPERATION.key) val operation = operationOverride.getOrElse { - (enableBulkInsert, isOverwritePartition, isOverwriteTable, dropDuplicate, isNonStrictMode, isPartitionedTable) match { -case (true, _, _, _, false, _) => + (enableBulkInsert, isOverwritePartition, isOverwriteTable, dropDuplicate, isNonStrictMode, isPartitionedTable, + autoGenerateRecordKeys) match { +case (true, _, _, _, false, _, _) => throw new IllegalArgumentException(s"Table with primaryKey can not use bulk insert in ${insertMode.value()} mode.") -case (true, true, _, _, _, true) => +case (true, true, _, _, _, true, _) => throw new IllegalArgumentException(s"Insert Overwrite Partition can not use bulk insert.") -case (true, _, _, true, _, _) => +case (true, _, _, true, _, _, _) => throw new IllegalArgumentException(s"Bulk insert cannot support drop duplication." + s" Please disable $INSERT_DROP_DUPS and try again.") // if enableBulkInsert is true, use bulk insert for the insert overwrite non-partitioned table. -case (true, false, true, _, _, false) => BULK_INSERT_OPERATION_OPT_VAL +case (true, false, true, _, _, false, _) => BULK_INSERT_OPERATION_OPT_VAL // insert overwrite table -case (false, false, true, _, _, _) => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL +case (false, false, true, _, _, _, _) => INSERT_OVERWRITE_TABLE_OPERATION_OPT_VAL // insert overwrite partition -case (_, true, false, _, _, true) => INSERT_OVERWRITE_OPERATION_OPT_VAL +case (_, true, false, _, _, true, _) => INSERT_OVERWRITE_OPERATION_OPT_VAL // disable dropDuplicate, and provide preCombineKey, use the upsert operation for strict and upsert mode. -case (false, false, false, false, false, _) if hasPrecombineColumn => UPSERT_OPERATION_OPT_VAL +case (false, false, false, false, false, _, _) if hasPrecombineColumn => UPSERT_OPERATION_OPT_VAL // if table is pk table and has enableBulkInsert use bulk insert for non-strict mode. -case (true, _, _, _, true, _) => BULK_INSERT_OPERATION_OPT_VAL +case (true, _, _, _, true, _, _) => BULK_INSERT_OPERATION_OPT_VAL +// if auto record key generation is enabled, use bulk_insert +case (_, _, _, _, _, true, true) => BULK_INSERT_OPERATION_OPT_VAL Review Comment: fixed it. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -402,6 +389,40 @@ object HoodieSparkSqlWriter { } } + def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : Map[String, String]): WriteOperationType = { +var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION)) +// TODO clean up Review Comment: nope. I just copied the comments over. not really sure ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -402,6 +389,40 @@ object HoodieSparkSqlWriter { } } + def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : Map[String, String]): WriteOperationType = { +var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION)) +// TODO clean up +// It does not make sense to allow upsert() operation if INSERT_DROP_DUPS is true +// Auto-correct the operation to "insert" if OPERATION is set to "upsert" wrongly +// or not set (in which case it will be set as "upsert" by parametersWithWriteDefaults()) . +if (hoodieConfig.getBoolean(INSERT_DROP_DUPS) && + operation == WriteOperationType.UPSERT) { + + log.warn(s"$UPSERT_OPERATION_OPT_VAL is not applicable " + +s"when $INSERT_DROP_DUPS is set to be true, " + +s"overriding the $OPERATION to be $INSERT_OPERATION_OPT_VAL") + + operation = WriteOperationType.INSERT +} + +// if no record key, no preCombine and no explicit partition path is set, we should treat it as append only workload Review Comment: we are being conservative here. Major motivation here is to target users coming from parquet table. They would be writing to parquet table as df.write.format("parquet").save(basePath). when they want to use hudi, table name is mandatory, but everything else if optional. So, all they need to do is,
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627855511 ## CI report: * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325) * beb523cca98b4b62964c485e464d5eb1ce6e25a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18432) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics
hudi-bot commented on PR #8881: URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627855419 ## CI report: * 7b9d3ca6b52733dc531f54d6bf21ca449cc3254f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17592) * aa126962099394261801fe5fbf679bd192033f80 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18431) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627853487 ## CI report: * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325) * beb523cca98b4b62964c485e464d5eb1ce6e25a2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics
hudi-bot commented on PR #8881: URL: https://github.com/apache/hudi/pull/8881#issuecomment-162785 ## CI report: * 7b9d3ca6b52733dc531f54d6bf21ca449cc3254f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17592) * aa126962099394261801fe5fbf679bd192033f80 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #8881: [HUDI-6316] Adding corrupted and rollback log blocks metrics
nsivabalan commented on PR #8881: URL: https://github.com/apache/hudi/pull/8881#issuecomment-1627851978 @codope : done. addressed the feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1627844217 hey @danny0405 @codope : Updated the patch. rebased w/ latest master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257549916 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -187,29 +218,38 @@ trait ProvidesHoodieConfig extends Logging { val insertModeOpt = combinedOpts.get(SQL_INSERT_MODE.key) val insertModeSet = insertModeOpt.nonEmpty +val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key()) +val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty +val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue()) +val insertDupPolicy = combinedOpts.getOrElse(INSERT_DUP_POLICY.key(), INSERT_DUP_POLICY.defaultValue()) val insertMode = InsertMode.of(insertModeOpt.getOrElse(SQL_INSERT_MODE.defaultValue())) val isNonStrictMode = insertMode == InsertMode.NON_STRICT val isPartitionedTable = hoodieCatalogTable.partitionFields.nonEmpty val combineBeforeInsert = hoodieCatalogTable.preCombineKey.nonEmpty && hoodieCatalogTable.primaryKeys.nonEmpty -// NOTE: Target operation could be overridden by the user, therefore if it has been provided as an input -// we'd prefer that value over auto-deduced operation. Otherwise, we deduce target operation type +// try to use sql write operation instead of legacy insert mode. If only insert mode is explicitly specified, we will uze +// o +val useLegacyInsertModeFlow = insertModeSet && !sqlWriteOperationSet val operation = combinedOpts.getOrElse(OPERATION.key, + if (useLegacyInsertModeFlow) { +// NOTE: Target operation could be overridden by the user, therefore if it has been provided as an input +// we'd prefer that value over auto-deduced operation. Otherwise, we deduce target operation type deduceWriteOperationForInsertInfo(isPartitionedTable, isOverwritePartition, isOverwriteTable, insertModeSet, dropDuplicate, -enableBulkInsert, isInsertInto, isNonStrictMode, combineBeforeInsert)) +enableBulkInsert, isInsertInto, isNonStrictMode, combineBeforeInsert) + } else { +deduceSqlWriteOperation(isOverwritePartition, isOverwriteTable, sqlWriteOperation) + } +) -val payloadClassName = if (operation == UPSERT_OPERATION_OPT_VAL && - tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) { - // Validate duplicate key for COW, for MOR it will do the merge with the DefaultHoodieRecordPayload - // on reading. - // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when SparkRecordMerger is default - classOf[ValidateDuplicateKeyPayload].getCanonicalName -} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == COW_TABLE_TYPE_OPT_VAL && - insertMode == InsertMode.STRICT){ - // Validate duplicate key for inserts to COW table when using strict insert mode. - classOf[ValidateDuplicateKeyPayload].getCanonicalName +val payloadClassName = if (useLegacyInsertModeFlow) { + deducePayloadClassNameLegacy(operation, tableType, insertMode) } else { - classOf[OverwriteWithLatestAvroPayload].getCanonicalName + // should we also consider old way of doing things. Review Comment: this is already taken care in deducePayloadClassNameLegacy, none of the downstream methods do anything differently. Its only used to deduce the payload class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257548654 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: this is not de-dup. this is actually achieving what you are claiming Danny. i.e. if you ingest RK1, val1 in commit and RK1, val2 in commit2 with insert operation type, snapshot will return both values only when you set "MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key" = true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257546507 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging { } } + private def deducePayloadClassNameLegacy(operation: String, tableType: String, insertMode: InsertMode): String = { +if (operation == UPSERT_OPERATION_OPT_VAL && + tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) { + // Validate duplicate key for COW, for MOR it will do the merge with the DefaultHoodieRecordPayload + // on reading. + // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when SparkRecordMerger is default + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == COW_TABLE_TYPE_OPT_VAL && + insertMode == InsertMode.STRICT){ + // Validate duplicate key for inserts to COW table when using strict insert mode. + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else { + classOf[OverwriteWithLatestAvroPayload].getCanonicalName +} Review Comment: yes, thats why I also did not switch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257546160 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -187,29 +218,38 @@ trait ProvidesHoodieConfig extends Logging { val insertModeOpt = combinedOpts.get(SQL_INSERT_MODE.key) val insertModeSet = insertModeOpt.nonEmpty +val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key()) +val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty +val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue()) +val insertDupPolicy = combinedOpts.getOrElse(INSERT_DUP_POLICY.key(), INSERT_DUP_POLICY.defaultValue()) val insertMode = InsertMode.of(insertModeOpt.getOrElse(SQL_INSERT_MODE.defaultValue())) val isNonStrictMode = insertMode == InsertMode.NON_STRICT val isPartitionedTable = hoodieCatalogTable.partitionFields.nonEmpty val combineBeforeInsert = hoodieCatalogTable.preCombineKey.nonEmpty && hoodieCatalogTable.primaryKeys.nonEmpty -// NOTE: Target operation could be overridden by the user, therefore if it has been provided as an input -// we'd prefer that value over auto-deduced operation. Otherwise, we deduce target operation type +// try to use sql write operation instead of legacy insert mode. If only insert mode is explicitly specified, we will uze +// o +val useLegacyInsertModeFlow = insertModeSet && !sqlWriteOperationSet Review Comment: Nope, we should call it out in our docs that with spark sql INSERT_INTO, users are expected to set this new config. If DataSourceWriteOperations.OPERATION, it may not be honored -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1257545983 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -514,6 +520,29 @@ object DataSourceWriteOptions { val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = HoodieCommonConfig.RECONCILE_SCHEMA + val SQL_WRITE_OPERATION: ConfigProperty[String] = ConfigProperty +.key("hoodie.sql.write.operation") +.defaultValue("insert") Review Comment: this is mainly used w/ INSERT_INTO. So, delete does not make sense. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -514,6 +520,29 @@ object DataSourceWriteOptions { val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = HoodieCommonConfig.RECONCILE_SCHEMA + val SQL_WRITE_OPERATION: ConfigProperty[String] = ConfigProperty +.key("hoodie.sql.write.operation") +.defaultValue("insert") +.withDocumentation("Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, " + + "insert and upsert. bulk_insert is generally meant for initial loads and is known to be performant compared to insert. But bulk_insert may not " + + "do small file managmeent. If you prefer hudi to automatically managee small files, then you can go with \"insert\". There is no precombine " + + "(if there are duplicates within the same batch being ingested, same dups will be ingested) with bulk_insert and insert and there is no index " + + "look up as well. If you may use INSERT_INTO for mutable dataset, then you may have to set this config value to \"upsert\". With upsert, you will " + + "get both precombine and updates to existing records on storage is also honored. If not, you may see duplicates. ") + + val NONE_INSERT_DUP_POLICY = "none" + val DROP_INSERT_DUP_POLICY = "drop" + val FAIL_INSERT_DUP_POLICY = "fail" + + val INSERT_DUP_POLICY: ConfigProperty[String] = ConfigProperty +.key("hoodie.datasource.insert.dup.policy") +.defaultValue(NONE_INSERT_DUP_POLICY) Review Comment: sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index (#9055)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 6d3311c34c8 [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index (#9055) 6d3311c34c8 is described below commit 6d3311c34c8548339aee8c3708fe50855a04afb8 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Mon Jul 10 01:24:44 2023 +0800 [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index (#9055) - Support delete_partition, insert_overwrite and insert_overwrite_table with record-level index. The metadata records should be updated accordingly. all records in the deleted partition(s) should be deleted from RLI (for delete_partition operation) newly inserted records should be present in RLI old records in the affected partitions should be removed from RLI old records that happen to have the same record key as the new inserts won't be removed from RLI; their entries will be updated - Co-authored-by: sivabalan --- .../metadata/HoodieBackedTableMetadataWriter.java | 65 ++ .../hudi/functional/TestRecordLevelIndex.scala | 28 +- 2 files changed, 80 insertions(+), 13 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 97d1ce5e8b2..df73145a1bb 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -42,7 +42,9 @@ import org.apache.hudi.common.model.HoodiePartitionMetadata; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.model.HoodieRecordDelegate; import org.apache.hudi.common.model.HoodieRecordLocation; +import org.apache.hudi.common.model.HoodieReplaceCommitMetadata; import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.log.HoodieLogFormat; import org.apache.hudi.common.table.log.block.HoodieDeleteBlock; @@ -477,7 +479,7 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta + partitions.size() + " partitions"); // Collect record keys from the files in parallel -HoodieData records = readRecordKeysFromBaseFiles(engineContext, partitionBaseFilePairs); +HoodieData records = readRecordKeysFromBaseFiles(engineContext, partitionBaseFilePairs, false); records.persist("MEMORY_AND_DISK_SER"); final long recordCount = records.count(); @@ -495,7 +497,8 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta * Read the record keys from base files in partitions and return records. */ private HoodieData readRecordKeysFromBaseFiles(HoodieEngineContext engineContext, - List> partitionBaseFilePairs) { + List> partitionBaseFilePairs, + boolean forDelete) { if (partitionBaseFilePairs.isEmpty()) { return engineContext.emptyHoodieData(); } @@ -524,8 +527,9 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta @Override public HoodieRecord next() { - return HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), partition, fileId, - instantTime); + return forDelete + ? HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next()) + : HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), partition, fileId, instantTime); } }; }); @@ -872,9 +876,10 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta // Updates for record index are created by parsing the WriteStatus which is a hudi-client object. Hence, we cannot yet move this code // to the HoodieTableMetadataUtil class in hudi-common. - if (writeStatus != null && !writeStatus.isEmpty()) { -partitionToRecordMap.put(MetadataPartitionType.RECORD_INDEX, getRecordIndexUpdates(writeStatus)); - } + HoodieData updatesFromWriteStatuses = getRecordIndexUpdates(writeStatus); + HoodieData additionalUpdates = getRecordIndexAdditionalUpdates(updatesFromWriteStatuses, commitMetadata); + partitionToRecordMap.put(MetadataPartitionType.RECORD_INDEX,
[GitHub] [hudi] nsivabalan merged pull request #9055: [HUDI-6443] Support delete_partition, insert_overwrite/table with record-level index
nsivabalan merged PR #9055: URL: https://github.com/apache/hudi/pull/9055 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] huyuanfeng2018 commented on issue #8265: [SUPPORT] Flink Table planner not loading problem
huyuanfeng2018 commented on issue #8265: URL: https://github.com/apache/hudi/issues/8265#issuecomment-1627766770 > > flink-table-planner has been removed from the classpath since flink1.15 > > Then how the classes in flink-table-planner is loaded then, can we do similiar loadings in Hudi side? This logic is executed when flink graph is built, not in jobmanager (except application-mode). The classpath running hudi-related logic does not load table-planner by default, because table-planner passes flink-table since flink1.15 -planner-loader dynamically loads flink-table-planner at runtime (because table-planner is not necessarily bound to blink), obviously hudi binds the sort implementation of blink in the code, so I think hudi may need to try to copy SortCodeGenerator related logic (I am not sure) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jlloh commented on issue #8191: [BUG] Unable to execute HTTP request | connection timeout issues
jlloh commented on issue #8191: URL: https://github.com/apache/hudi/issues/8191#issuecomment-1627759070 Seeing something similar for Flink 1.16 with Hudi 0.13.1, COW insert with metadata enabled. Problem seems to occur ~4 hours after the job has been running. The job is an inline clustering job. After disabling metadata, the job is able to proceed. Configurations: ``` "table.table": "COPY_ON_WRITE" "write.operation": "insert" "write.insert.cluster": "true" "hoodie.datasource.write.hive_style_partitioning": "true" "metadata.enabled": "true" "hoodie.datasource.write.hive_style_partitioning": "true" "hoodie.parquet.max.file.size": "104857600" "hoodie.parquet.small.file.limit": "20971520" "clustering.plan.strategy.small.file.limit": "100" ``` Files: ~211 parquet files per partition across 4 hourly partitions when the issue started happening and the job failed to continue. The bucket assigner task is the one that hits this error. I have tried both hourly and daily partitions but both jobs seem to eventually fail and not able to recover with metadata enabled. Full stacktrace: ``` org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve files in partition s3a:///folder_name/local_year=2023/local_month=07/local_day=08 from metadata at org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:152) at org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:69) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$16(AbstractTableFileSystemView.java:432) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:423) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestBaseFilesBeforeOrOn(AbstractTableFileSystemView.java:660) at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:104) at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:145) at org.apache.hudi.sink.partitioner.profile.WriteProfile.smallFilesProfile(WriteProfile.java:208) at org.apache.hudi.sink.partitioner.profile.WriteProfile.getSmallFiles(WriteProfile.java:191) at org.apache.hudi.sink.partitioner.BucketAssigner.getSmallFileAssign(BucketAssigner.java:179) at org.apache.hudi.sink.partitioner.BucketAssigner.addInsert(BucketAssigner.java:137) at org.apache.hudi.sink.partitioner.BucketAssignFunction.getNewRecordLocation(BucketAssignFunction.java:215) at org.apache.hudi.sink.partitioner.BucketAssignFunction.processRecord(BucketAssignFunction.java:200) at org.apache.hudi.sink.partitioner.BucketAssignFunction.processElement(BucketAssignFunction.java:162) at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:374) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223) at
[jira] [Updated] (HUDI-6515) Bucket index row writer write record to wrong handle
[ https://issues.apache.org/jira/browse/HUDI-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6515: - Labels: pull-request-available (was: ) > Bucket index row writer write record to wrong handle > > > Key: HUDI-6515 > URL: https://issues.apache.org/jira/browse/HUDI-6515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Qijun Fu >Assignee: Qijun Fu >Priority: Critical > Labels: pull-request-available > > Bucket index row writer write record to wrong handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle
hudi-bot commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627738046 ## CI report: * 4416219014959fe02290bf41695abb70c31c1715 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5759) Hudi do not support add column on mor table with log
[ https://issues.apache.org/jira/browse/HUDI-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qijun Fu closed HUDI-5759. -- Fix Version/s: 0.13.0 Resolution: Fixed > Hudi do not support add column on mor table with log > > > Key: HUDI-5759 > URL: https://issues.apache.org/jira/browse/HUDI-5759 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Qijun Fu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0, 0.13.0 > > > We test the following sqls in the latest master branch > ```sql > create table h0 ( > id int, > name string, > price double, > ts long > ) using hudi > options ( > primaryKey ='id', > type = 'mor', > preCombineField = 'ts' > ) > partitioned by(ts) > location '/tmp/h0'; > insert into h0 select 1, 'a1', 10, 1000; > update h0 set price = 20 where id = 1; > alter table h0 add column new_col1 int; > update h0 set price = 22 where id = 1; > select * from h0; > ``` > And found that we can't read the table after add column and update. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6336) Support TimelineBased Checkpoint Metadata for flink
[ https://issues.apache.org/jira/browse/HUDI-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qijun Fu reassigned HUDI-6336: -- Assignee: Qijun Fu > Support TimelineBased Checkpoint Metadata for flink > --- > > Key: HUDI-6336 > URL: https://issues.apache.org/jira/browse/HUDI-6336 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Qijun Fu >Assignee: Qijun Fu >Priority: Major > > To reduce oss list qps in flink write, we can use timeline server to serve > the ckp metadata request -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6337) Incremental Clean skip fetch commit metadata for append mode
[ https://issues.apache.org/jira/browse/HUDI-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qijun Fu closed HUDI-6337. -- Resolution: Won't Fix > Incremental Clean skip fetch commit metadata for append mode > - > > Key: HUDI-6337 > URL: https://issues.apache.org/jira/browse/HUDI-6337 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: Qijun Fu >Priority: Major > Labels: pull-request-available > > Incremental Clean skip fetch commit metadata for append mode. > > In append mode we don't need to clean the data file at all because there are > always only one version for one file group. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6515) Bucket index row writer write record to wrong handle
[ https://issues.apache.org/jira/browse/HUDI-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qijun Fu reassigned HUDI-6515: -- Assignee: Qijun Fu > Bucket index row writer write record to wrong handle > > > Key: HUDI-6515 > URL: https://issues.apache.org/jira/browse/HUDI-6515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Qijun Fu >Assignee: Qijun Fu >Priority: Critical > > Bucket index row writer write record to wrong handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle
hudi-bot commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627697564 ## CI report: * 4416219014959fe02290bf41695abb70c31c1715 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18429) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle
hudi-bot commented on PR #9156: URL: https://github.com/apache/hudi/pull/9156#issuecomment-1627695267 ## CI report: * 4416219014959fe02290bf41695abb70c31c1715 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle
stream2000 commented on code in PR #9156: URL: https://github.com/apache/hudi/pull/9156#discussion_r1257466200 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala: ## @@ -1247,6 +1247,27 @@ class TestInsertTable extends HoodieSparkSqlTestBase { Seq(3, "a3,3", 30.0, 3000, "2021-01-07"), Seq(3, "a3", 30.0, 3000, "2021-01-07") ) + +// there are two files in partition(dt = '2021-01-05') +checkAnswer(s"select count(distinct _hoodie_file_name) from $tableName where dt = '2021-01-05'")( + Seq(2) +) + +// would generate 6 other files in partition(dt = '2021-01-05') +spark.sql( + s""" + | insert into $tableName values + | (4, 'a1,1', 10, 1000, "2021-01-05"), + | (5, 'a1,1', 10, 1000, "2021-01-05"), + | (6, 'a1,1', 10, 1000, "2021-01-05"), + | (7, 'a1,1', 10, 1000, "2021-01-05"), + | (8, 'a1,1', 10, 1000, "2021-01-05"), + | (9, 'a3,3', 30, 3000, "2021-01-05") + """.stripMargin) + +checkAnswer(s"select count(distinct _hoodie_file_name) from $tableName where dt = '2021-01-05'")( + Seq(8) Review Comment: The result will be 3 without this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9156: [Hudi-6515] Fix bucket index row writer write record to wrong handle
stream2000 commented on code in PR #9156: URL: https://github.com/apache/hudi/pull/9156#discussion_r1257466066 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java: ## @@ -91,27 +90,20 @@ private UTF8String extractRecordKey(InternalRow row) { } } - protected HoodieRowCreateHandle getBucketRowCreateHandle(String fileId, int bucketId) throws Exception { -if (!handles.containsKey(fileId)) { // if there is no handle corresponding to the fileId Review Comment: The fileid passed here is actually partitonPath, so `handles.containsKey(fileId)` is always true for record in the same partition thus no new handle will be created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 opened a new pull request, #9156: [Hudi-6515] Fix bucket index row writer
stream2000 opened a new pull request, #9156: URL: https://github.com/apache/hudi/pull/9156 ### Change Logs Fix bucket index row writer would write record to wrong handle and remove some useless code since the partition record is sorted. ### Impact Fix bucket index row writer would write record to wrong handle ### Risk level (write none, low medium or high below) medium ### Documentation Update NONE ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6515) Bucket index row writer write record to wrong handle
Qijun Fu created HUDI-6515: -- Summary: Bucket index row writer write record to wrong handle Key: HUDI-6515 URL: https://issues.apache.org/jira/browse/HUDI-6515 Project: Apache Hudi Issue Type: Bug Reporter: Qijun Fu Bucket index row writer write record to wrong handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] parisni commented on issue #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource
parisni commented on issue #9026: URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627681838 @yihua > If the metadata table is queried through Spark datasource directly after MDT compaction (i.e., no additional log file in the latest file slice), there is no duplicate. Did you add new partition during that step ? It turns out the duplication occurs when new partitions are added after compaction. see below: when no new partitions, no duplication. When new partitions, then it gets tons of duplicates. ```python sc.setLogLevel("ERROR") tableName = 'test_corrupted_mdt' basePath = "/tmp/{tableName}".format(tableName=tableName) hudi_options = { "hoodie.table.name": tableName, "hoodie.datasource.write.recordkey.field": "event_id", "hoodie.datasource.write.partitionpath.field": "part", "hoodie.datasource.write.table.name": tableName, "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.precombine.field": "ts", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "false", "hoodie.metadata.enable": "true", } mode="overwrite" for i in range(1,22): df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as part") # <-- W/ adding new partitions # df =spark.sql("select '1' as event_id, '2' as ts, '2' as part") <-- W/O adding new partitions (df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath)) mode="append" ct = spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count() print("NB:"+str(ct) + " for iteration:" + str(i)) NB:2 for iteration:1 NB:3 for iteration:2 NB:4 for iteration:3 NB:5 for iteration:4 NB:6 for iteration:5 NB:7 for iteration:6 NB:8 for iteration:7 NB:9 for iteration:8 NB:10 for iteration:9 NB:21 for iteration:10 <--- MDT COMPACTION NB:32 for iteration:11 NB:43 for iteration:12 NB:54 for iteration:13 NB:65 for iteration:14 NB:76 for iteration:15 NB:87 for iteration:16 NB:98 for iteration:17 NB:109 for iteration:18 NB:120 for iteration:19 NB:41 for iteration:20 <--- MDT COMPACTION NB:62 for iteration:21 NB:2 for iteration:1 NB:2 for iteration:2 NB:2 for iteration:3 NB:2 for iteration:4 NB:2 for iteration:5 NB:2 for iteration:6 NB:2 for iteration:7 NB:2 for iteration:8 NB:2 for iteration:9 NB:2 for iteration:10 <--- MDT COMPACTION NB:2 for iteration:11 NB:2 for iteration:12 NB:2 for iteration:13 NB:2 for iteration:14 NB:2 for iteration:15 NB:2 for iteration:16 NB:2 for iteration:17 NB:2 for iteration:18 NB:2 for iteration:19 NB:2 for iteration:20 <--- MDT COMPACTION NB:2 for iteration:21 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on a diff in pull request #8679: [RFC-69] Hudi 1.X
garyli1019 commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1257449903 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We have been adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Flink, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational
[jira] [Closed] (HUDI-5861) table can not read date after overwrite table with bucket index
[ https://issues.apache.org/jira/browse/HUDI-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KnightChess closed HUDI-5861. - Resolution: Won't Fix > table can not read date after overwrite table with bucket index > --- > > Key: HUDI-5861 > URL: https://issues.apache.org/jira/browse/HUDI-5861 > Project: Apache Hudi > Issue Type: Bug > Components: core >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > > a table with bucket index, if `insert overwrite` the same partition twice, > can not get any data will `query sql` > reason: insert overwrite in bucket index will tagLocation, the new file > version will use the same fgId, so the lastest file slice will be filtered by > `replacecommit` file with the same fileGroupId. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5091) MergeInto syntax merge_condition does not support Non-Equal
[ https://issues.apache.org/jira/browse/HUDI-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KnightChess closed HUDI-5091. - Fix Version/s: 0.14.0 Resolution: Won't Fix > MergeInto syntax merge_condition does not support Non-Equal > --- > > Key: HUDI-5091 > URL: https://issues.apache.org/jira/browse/HUDI-5091 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > Fix For: 0.14.0 > > > Merge into sql merge condition support Non-equal condition > https://github.com/apache/hudi/issues/6400 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-5091) MergeInto syntax merge_condition does not support Non-Equal
[ https://issues.apache.org/jira/browse/HUDI-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KnightChess resolved HUDI-5091. --- > MergeInto syntax merge_condition does not support Non-Equal > --- > > Key: HUDI-5091 > URL: https://issues.apache.org/jira/browse/HUDI-5091 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: KnightChess >Assignee: KnightChess >Priority: Major > > Merge into sql merge condition support Non-equal condition > https://github.com/apache/hudi/issues/6400 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6514) Remove hoodie.auto.commit
zouxxyy created HUDI-6514: - Summary: Remove hoodie.auto.commit Key: HUDI-6514 URL: https://issues.apache.org/jira/browse/HUDI-6514 Project: Apache Hudi Issue Type: Improvement Reporter: zouxxyy The usage of `hoodie.auto.commit` is very disgusting, I don't know the meaning of this configuration at all -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on issue #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource
yihua commented on issue #9026: URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627630057 Based on my investigation, it looks like that the HFile does not contain duplicate entries. If the metadata table is queried through Spark datasource directly after MDT compaction (i.e., no additional log file in the latest file slice), there is no duplicate. The duplicate seems to be related to Spark datasource read on MDT only, not affecting the MDT read through `HoodieBackedTableMetadata` APIs. The performance is a separate issue we need to look into. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org