[GitHub] [hudi] linfey90 commented on a diff in pull request #8865: [HUDI-6306] dynamic catalog parameter
linfey90 commented on code in PR #8865: URL: https://github.com/apache/hudi/pull/8865#discussion_r1213956512 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/HadoopConfigurations.java: ## @@ -63,6 +63,7 @@ public static org.apache.hadoop.conf.Configuration getHiveConf(Configuration con if (explicitDir != null) { hadoopConf.addResource(new Path(explicitDir, "hive-site.xml")); } +conf.toMap().forEach(hadoopConf::set); return hadoopConf; Review Comment: Not only the configuration of hive-site.xml, but also the storage configuration is also wanted to be passed in dynamically to the bottom layer because the two configurations are not separated.The goal is for the upper-layer application to be able to dynamically switch the catalog and switch the underlying storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6308) add num_commits_after_last_request to flink
[ https://issues.apache.org/jira/browse/HUDI-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eric closed HUDI-6308. -- Resolution: Not A Problem > add num_commits_after_last_request to flink > --- > > Key: HUDI-6308 > URL: https://issues.apache.org/jira/browse/HUDI-6308 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
hudi-bot commented on PR #8867: URL: https://github.com/apache/hudi/pull/8867#issuecomment-1573191026 ## CI report: * 7e24575b30fc34d8174a74a98431c6e1f42bef7c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on a diff in pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
eric9204 commented on code in PR #8871: URL: https://github.com/apache/hudi/pull/8871#discussion_r1213950388 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -642,6 +642,7 @@ private FlinkOptions() { public static final String TIME_ELAPSED = "time_elapsed"; public static final String NUM_AND_TIME = "num_and_time"; public static final String NUM_OR_TIME = "num_or_time"; + public static final String NUM_COMMITS_AFTER_LAST_REQUEST = "num_commits_after_last_request"; @AdvancedConfig public static final ConfigOption COMPACTION_TRIGGER_STRATEGY = ConfigOptions Review Comment: Got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 closed pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
eric9204 closed pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink URL: https://github.com/apache/hudi/pull/8871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
danny0405 commented on code in PR #8871: URL: https://github.com/apache/hudi/pull/8871#discussion_r1213944708 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -642,6 +642,7 @@ private FlinkOptions() { public static final String TIME_ELAPSED = "time_elapsed"; public static final String NUM_AND_TIME = "num_and_time"; public static final String NUM_OR_TIME = "num_or_time"; + public static final String NUM_COMMITS_AFTER_LAST_REQUEST = "num_commits_after_last_request"; @AdvancedConfig public static final ConfigOption COMPACTION_TRIGGER_STRATEGY = ConfigOptions Review Comment: `NUM_COMMITS_AFTER_LAST_REQUEST` this variable is not used by any other code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on a diff in pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
eric9204 commented on code in PR #8871: URL: https://github.com/apache/hudi/pull/8871#discussion_r1213939610 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -642,6 +642,7 @@ private FlinkOptions() { public static final String TIME_ELAPSED = "time_elapsed"; public static final String NUM_AND_TIME = "num_and_time"; public static final String NUM_OR_TIME = "num_or_time"; + public static final String NUM_COMMITS_AFTER_LAST_REQUEST = "num_commits_after_last_request"; @AdvancedConfig public static final ConfigOption COMPACTION_TRIGGER_STRATEGY = ConfigOptions Review Comment: @danny0405 Thank you for your reply. By adding this parameter `'compaction.trigger.strategy'='num_commits_after_last_request'`, in my test, the job can generate a compaction plan for every fixed number of `deltacommits`. I don't understand that "The variable may not be used anywhere". Is there a problem with this compaction trigger strategy in some scenarios? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8759: Add metrics counters for compaction requested/completed events.
SteNicholas commented on code in PR #8759: URL: https://github.com/apache/hudi/pull/8759#discussion_r1213933746 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java: ## @@ -129,6 +152,10 @@ public void testCompactionEmpty() { String compactionInstantTime = HoodieActiveTimeline.createNewInstantTime(); Option plan = table.scheduleCompaction(context, compactionInstantTime, Option.empty()); assertFalse(plan.isPresent(), "If there is nothing to compact, result will be empty"); + + // Verify compaction.requested, compaction.completed metrics counts. + assertEquals(0, getCompactionMetricCount("counter", "compaction.requested")); Review Comment: @amrishlal, you could use const variable `REQUESTED_COMPACTION_EXTENSION` in `HoodieTimeline`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #8759: Add metrics counters for compaction requested/completed events.
SteNicholas commented on PR #8759: URL: https://github.com/apache/hudi/pull/8759#issuecomment-1573162218 @amrishlal, could you create a JIRA ticket or issue for this metric introduction? Meanwhile, you should update the title of this pull request, otherwise this pull request fails to validate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6241) HIVE_SYNC_TABLE_STRATEGY in HiveSyncConfigHolder Documentation fix
[ https://issues.apache.org/jira/browse/HUDI-6241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6241: - Fix Version/s: 0.14.0 > HIVE_SYNC_TABLE_STRATEGY in HiveSyncConfigHolder Documentation fix > -- > > Key: HUDI-6241 > URL: https://issues.apache.org/jira/browse/HUDI-6241 > Project: Apache Hudi > Issue Type: Improvement > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > document: Hive table synchronization strategy. Available option: ONLY_RO, > ONLY_RT, ALL. > > ONLY_RO,ONLY_RT need to change to RO and RT -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] stream2000 commented on a diff in pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
stream2000 commented on code in PR #8745: URL: https://github.com/apache/hudi/pull/8745#discussion_r1213873676 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java: ## @@ -298,6 +298,22 @@ protected void syncHoodieTable(String tableName, boolean useRealtimeInputFormat, LOG.info("Sync complete for " + tableName); } + private boolean needToSyncAllPartitions(Option lastCommitTimeSynced) { +if (!lastCommitTimeSynced.isPresent()) { + return true; +} +if (config.getBoolean(META_SYNC_USE_STATE_TRANSIENT_TIME)) { + // If we use state transient time to sync partitions and the last commit time synced is before latest archive time + // We need to fall back to list all partitions instead of load the whole archive timeline + Option latestArchiveTime = syncClient.getLastArchiveTime(); Review Comment: Thanks for your advice~Will take a look at that pr and see if we can get rid of introducing the config `META_SYNC_USE_STATE_TRANSIENT_TIME ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
nsivabalan commented on code in PR #8758: URL: https://github.com/apache/hudi/pull/8758#discussion_r1213863164 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java: ## @@ -351,6 +347,7 @@ public void run() { .filterCompletedInstants().filter(i -> i.getTimestamp().equals(instantTime)).firstInstant(); instant = currentInstant.orElse(instant); // so that timeline is not reloaded very frequently +// TODO: this does not handle the case that the commit has indeed failed. Maybe use HB detection here. Review Comment: can we file follow up tickets for all these please? ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java: ## @@ -338,14 +338,20 @@ protected void initMetadataTable(Option instantTime) { * @param inFlightInstantTimestamp - The in-flight action responsible for the metadata table initialization */ private void initializeMetadataTable(Option inFlightInstantTimestamp) { -if (config.isMetadataTableEnabled()) { - HoodieTableMetadataWriter writer = SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), config, - context, Option.empty(), inFlightInstantTimestamp); - try { -writer.close(); - } catch (Exception e) { -throw new HoodieException("Failed to instantiate Metadata table ", e); +if (!config.isMetadataTableEnabled()) { + LOG.error("11"); Review Comment: lets fix these unintended changes ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java: ## @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.index; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.data.HoodieData; +import org.apache.hudi.common.data.HoodiePairData; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.model.HoodieAvroRecord; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordGlobalLocation; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.common.util.collection.ImmutablePair; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.data.HoodieJavaPairRDD; +import org.apache.hudi.data.HoodieJavaRDD; +import org.apache.hudi.exception.HoodieIndexException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; +import org.apache.hudi.metadata.HoodieTableMetadataUtil; +import org.apache.hudi.metadata.MetadataPartitionType; +import org.apache.hudi.table.HoodieTable; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.function.PairFlatMapFunction; +import org.apache.spark.sql.execution.PartitionIdPassthrough; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.Tuple2; + +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; +import java.util.Map; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN; + +/** + * Hoodie Index implementation backed by the record index present in the Metadata Table. + */ +public class SparkMetadataTableRecordIndex extends HoodieIndex { + + private static final Logger LOG = LoggerFactory.getLogger(SparkMetadataTableRecordIndex.class); + // The index to fallback upon when record index is not initialized yet. + // This should be a global index like record index so that the behavior of tagging across partitions is not changed. + private static final
[GitHub] [hudi] danny0405 commented on a diff in pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
danny0405 commented on code in PR #8745: URL: https://github.com/apache/hudi/pull/8745#discussion_r1213871041 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java: ## @@ -298,6 +298,22 @@ protected void syncHoodieTable(String tableName, boolean useRealtimeInputFormat, LOG.info("Sync complete for " + tableName); } + private boolean needToSyncAllPartitions(Option lastCommitTimeSynced) { +if (!lastCommitTimeSynced.isPresent()) { + return true; +} +if (config.getBoolean(META_SYNC_USE_STATE_TRANSIENT_TIME)) { + // If we use state transient time to sync partitions and the last commit time synced is before latest archive time + // We need to fall back to list all partitions instead of load the whole archive timeline + Option latestArchiveTime = syncClient.getLastArchiveTime(); Review Comment: I believe we can get rid of the config option `META_SYNC_USE_STATE_TRANSIENT_TIME` if we keep both the start time(instant time) and max completion time(transition time) in the HMS. We can use the instant time to check the max version id, and the completion time for real sync progress. Then we can fix the 'hollow' instants that are missed, just like what I fix in commit: https://github.com/apache/hudi/pull/8611 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
hudi-bot commented on PR #8871: URL: https://github.com/apache/hudi/pull/8871#issuecomment-1573073017 ## CI report: * 88a2f49c4c02ce1ade13549ac56d6bf396411289 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
hudi-bot commented on PR #8867: URL: https://github.com/apache/hudi/pull/8867#issuecomment-1573072987 ## CI report: * 6a8fa73c9e31a90f6249772b5b840acf42ae1df5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17560) * 7e24575b30fc34d8174a74a98431c6e1f42bef7c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
hudi-bot commented on PR #8871: URL: https://github.com/apache/hudi/pull/8871#issuecomment-1573067586 ## CI report: * 88a2f49c4c02ce1ade13549ac56d6bf396411289 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
hudi-bot commented on PR #8867: URL: https://github.com/apache/hudi/pull/8867#issuecomment-1573067541 ## CI report: * 6a8fa73c9e31a90f6249772b5b840acf42ae1df5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17560) * 7e24575b30fc34d8174a74a98431c6e1f42bef7c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8871: [HUDI-6308]add num_commits_after_last_request to flink
danny0405 commented on code in PR #8871: URL: https://github.com/apache/hudi/pull/8871#discussion_r1213865107 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -642,6 +642,7 @@ private FlinkOptions() { public static final String TIME_ELAPSED = "time_elapsed"; public static final String NUM_AND_TIME = "num_and_time"; public static final String NUM_OR_TIME = "num_or_time"; + public static final String NUM_COMMITS_AFTER_LAST_REQUEST = "num_commits_after_last_request"; @AdvancedConfig public static final ConfigOption COMPACTION_TRIGGER_STRATEGY = ConfigOptions Review Comment: The varialbe may not be used anywhere, maybe we should just fix the doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8855: [SUPPORT][FLINK SQL] Can not insert join result into hudi table
danny0405 commented on issue #8855: URL: https://github.com/apache/hudi/issues/8855#issuecomment-1573064043 Intreasting, can you share us the flink checkpoint configuration params? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6293) Make HoodieFlinkCompactor's parallelism of compact_task more reasonable.
[ https://issues.apache.org/jira/browse/HUDI-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728544#comment-17728544 ] Danny Chen commented on HUDI-6293: -- Another fix for clustering: b36e7c459904860b0be086c144ba0b175961e805 > Make HoodieFlinkCompactor's parallelism of compact_task more reasonable. > - > > Key: HUDI-6293 > URL: https://issues.apache.org/jira/browse/HUDI-6293 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: image-2023-05-31-16-41-02-798.png > > > !image-2023-05-31-16-41-02-798.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 merged pull request #8866: [HUDI-6293] Make HoodieClusteringJob's parallelism of clustering_task…
danny0405 merged PR #8866: URL: https://github.com/apache/hudi/pull/8866 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6293] Make HoodieClusteringJob's parallelism of clustering_task more reasonable (#8866)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b36e7c45990 [HUDI-6293] Make HoodieClusteringJob's parallelism of clustering_task more reasonable (#8866) b36e7c45990 is described below commit b36e7c459904860b0be086c144ba0b175961e805 Author: voonhous AuthorDate: Fri Jun 2 10:52:04 2023 +0800 [HUDI-6293] Make HoodieClusteringJob's parallelism of clustering_task more reasonable (#8866) --- .../org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java| 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java index 633f06b0e4f..223f85defca 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java @@ -310,9 +310,12 @@ public class HoodieFlinkClusteringJob { HoodieInstant instant = HoodieTimeline.getReplaceCommitRequestedInstant(clusteringInstant.getTimestamp()); + int inputGroupSize = clusteringPlan.getInputGroups().size(); + // get clusteringParallelism. int clusteringParallelism = conf.getInteger(FlinkOptions.CLUSTERING_TASKS) == -1 - ? clusteringPlan.getInputGroups().size() : conf.getInteger(FlinkOptions.CLUSTERING_TASKS); + ? inputGroupSize + : Math.min(conf.getInteger(FlinkOptions.CLUSTERING_TASKS), inputGroupSize); // Mark instant as clustering inflight table.getActiveTimeline().transitionReplaceRequestedToInflight(instant, Option.empty());
[jira] [Updated] (HUDI-6258) support olap engine query mor table in table name without ro/rt suffix
[ https://issues.apache.org/jira/browse/HUDI-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6258: - Description: when we query mor table with olap engine such as starrocks、doris、presto,we can get data only from rt/ro table,this is beacause hudi did not sync meta to tablename,so need fix it to support query as above conditions (was: when we query mor table with olap engine such as starrocks、doris、presto,we can get data only from rt/ro table,this is beacause hudi did not sync meta to tablename,so need fix it to support query as above condition) > support olap engine query mor table in table name without ro/rt suffix > -- > > Key: HUDI-6258 > URL: https://issues.apache.org/jira/browse/HUDI-6258 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > when we query mor table with olap engine such as starrocks、doris、presto,we > can get data only from rt/ro table,this is beacause hudi did not sync meta to > tablename,so need fix it to support query as above conditions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6258) support olap engine query mor table in table name without ro/rt suffix
[ https://issues.apache.org/jira/browse/HUDI-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-6258: - Description: when we query mor table with olap engine such as starrocks、doris、presto,we can get data only from rt/ro table,this is beacause hudi did not sync meta to tablename,so need fix it to support query as above condition (was: when we query mor table with olap engine such as starrocks、doris,we can get data only from rt/ro table,this is beacause hudi did not sync meta to tablename,so need fix it to support query as above condition) > support olap engine query mor table in table name without ro/rt suffix > -- > > Key: HUDI-6258 > URL: https://issues.apache.org/jira/browse/HUDI-6258 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > when we query mor table with olap engine such as starrocks、doris、presto,we > can get data only from rt/ro table,this is beacause hudi did not sync meta to > tablename,so need fix it to support query as above condition -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #8830: [MINOR] auto generate init client id
danny0405 commented on code in PR #8830: URL: https://github.com/apache/hudi/pull/8830#discussion_r1213860936 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/configuration/TestOptionsInference.java: ## @@ -69,6 +70,12 @@ void testSetupClientId() throws Exception { } } + @Test + void testAutoGenerateClient() { + Configuration conf = getConf(); + OptionsInference.setupClientId(conf); + assertNotNull(conf.getString(FlinkOptions.WRITE_CLIENT_ID), "auto generate client failed!"); + } Review Comment: > all writer will shared the ckp_meta How could that happen then providing the last client already sent the heartbeat? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
danny0405 commented on PR #8867: URL: https://github.com/apache/hudi/pull/8867#issuecomment-1573057917 Hi @satishkotha can you help double check this change? I need some help for the background why in https://github.com/apache/hudi/pull/2129 only timestamp(6) is synced as timestamp in Hive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
garyli1019 commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1213858505 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We have been adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Flink, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational
[GitHub] [hudi] danny0405 commented on a diff in pull request #8865: [HUDI-6306] dynamic catalog parameter
danny0405 commented on code in PR #8865: URL: https://github.com/apache/hudi/pull/8865#discussion_r1213858024 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/HadoopConfigurations.java: ## @@ -63,6 +63,7 @@ public static org.apache.hadoop.conf.Configuration getHiveConf(Configuration con if (explicitDir != null) { hadoopConf.addResource(new Path(explicitDir, "hive-site.xml")); } +conf.toMap().forEach(hadoopConf::set); return hadoopConf; Review Comment: Seems you wanna pass around the hive config options through Flink sql options, this is not suggested, We prefer to config the hive properties through the `hive-site.xml` in the classpath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] c-f-cooper commented on a diff in pull request #8830: [MINOR] auto generate init client id
c-f-cooper commented on code in PR #8830: URL: https://github.com/apache/hudi/pull/8830#discussion_r1213854825 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/configuration/TestOptionsInference.java: ## @@ -69,6 +70,12 @@ void testSetupClientId() throws Exception { } } + @Test + void testAutoGenerateClient() { + Configuration conf = getConf(); + OptionsInference.setupClientId(conf); + assertNotNull(conf.getString(FlinkOptions.WRITE_CLIENT_ID), "auto generate client failed!"); + } Review Comment: > It should be a bug, the client still send heartbeat anyway for the INIT_CLIEN _ID: > > https://github.com/apache/hudi/blob/00d50e91abe24aba31daa2fe2806de5414f03c77/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClientIds.java#L179 Maybe,bug if the INIT_CLIENT_ID is empty,all writer will shared the ckp_meta,the risk of concurrent modification will occupy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
nsivabalan commented on code in PR #8758: URL: https://github.com/apache/hudi/pull/8758#discussion_r1213814148 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -111,18 +111,27 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta public static final String METADATA_COMPACTION_TIME_SUFFIX = "001"; + // Virtual keys support for metadata table. This Field is + // from the metadata payload schema. + private static final String RECORD_KEY_FIELD_NAME = HoodieMetadataPayload.KEY_FIELD_NAME; + + // Average size of a record saved within the record index. + // Record index has a fixed size schema. This has been calculated based on experiments with default settings + // for block size (4MB), compression (GZ) and disabling the hudi metadata fields. Review Comment: default hfile block size on write in OSS is 1MB. Do we need to fix that? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -370,11 +336,10 @@ private boolean isBootstrapNeeded(Option partitionInfoList = filesPartitionAvailable ? listAllPartitionsFromMDT(initializationTime) : listAllPartitionsFromFilesystem(initializationTime); +Map> partitionToFilesMap = partitionInfoList.stream() +.map(p -> { + String partitionName = HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath()); + return Pair.of(partitionName, p.getFileNameToSizeMap()); +}) +.collect(Collectors.toMap(Pair::getKey, Pair::getValue)); + +for (MetadataPartitionType partitionType : partitionsToInit) { + // Find the commit timestamp to use for this partition. Each initialization should use its own unique commit time. + String commitTimeForPartition = generateUniqueCommitInstantTime(initializationTime); + + LOG.info("Initializing MDT partition " + partitionType + " at instant " + commitTimeForPartition); + + Pair> fileGroupCountAndRecordsPair; + switch (partitionType) { +case FILES: + fileGroupCountAndRecordsPair = initializeFilesPartition(initializationTime, partitionInfoList); + break; +case BLOOM_FILTERS: + fileGroupCountAndRecordsPair = initializeBloomFiltersPartition(initializationTime, partitionToFilesMap); + break; +case COLUMN_STATS: + fileGroupCountAndRecordsPair = initializeColumnStatsPartition(partitionToFilesMap); + break; +case RECORD_INDEX: + fileGroupCountAndRecordsPair = initializeRecordIndexPartition(); + break; +default: + throw new HoodieMetadataException("Unsupported MDT partition type: " + partitionType); + } + + // Generate the file groups + final int fileGroupCount = fileGroupCountAndRecordsPair.getKey(); + ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for MDT partition " + partitionType + " should be > 0"); + initializeFileGroups(dataMetaClient, partitionType, commitTimeForPartition, fileGroupCount); + + // Perform the commit using bulkCommit + HoodieData records = fileGroupCountAndRecordsPair.getValue(); + bulkCommit(commitTimeForPartition, partitionType, records, fileGroupCount); + metadataMetaClient.reloadActiveTimeline(); + dataMetaClient.getTableConfig().setMetadataPartitionState(dataMetaClient, partitionType, true); + initMetadataReader(); } -initializeEnabledFileGroups(dataMetaClient, createInstantTime, enabledPartitionTypes); -initialCommit(createInstantTime, enabledPartitionTypes); -updateInitializedPartitionsInTableConfig(enabledPartitionTypes); + return true; } - private String getInitialCommitInstantTime(HoodieTableMetaClient dataMetaClient) { -// If there is no commit on the dataset yet, use the SOLO_COMMIT_TIMESTAMP as the instant time for initial commit -// Otherwise, we use the timestamp of the latest completed action. -String createInstantTime = dataMetaClient.getActiveTimeline().filterCompletedInstants() - .getReverseOrderedInstants().findFirst().map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP); -LOG.info("Creating a new metadata table in " + metadataWriteConfig.getBasePath() + " at instant " + createInstantTime); -return createInstantTime; + /** + * Returns a unique timestamp to use for initializing a MDT partition. + * + * Since commits are immutable, we should use unique timestamps to initialize each partition. For this, we will add a suffix to the given initializationTime + * until we find a unique timestamp. + * + * @param initializationTime Timestamp from dataset to use for initialization + * @return a unique timestamp for MDT + */ + private String generateUniqueCommitInstantTime(String initializationTime) { +// Add
[GitHub] [hudi] danny0405 commented on a diff in pull request #8863: [HUDI-6305] s3a parameters cannot be filtered
danny0405 commented on code in PR #8863: URL: https://github.com/apache/hudi/pull/8863#discussion_r1213851720 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/HadoopConfigurations.java: ## @@ -49,8 +48,7 @@ public static org.apache.hadoop.conf.Configuration getParquetConf( */ public static org.apache.hadoop.conf.Configuration getHadoopConf(Configuration conf) { org.apache.hadoop.conf.Configuration hadoopConf = FlinkClientUtil.getHadoopConf(); -Map options = FlinkOptions.getPropertiesWithPrefix(conf.toMap(), HADOOP_PREFIX); -options.forEach(hadoopConf::set); +conf.toMap().forEach(hadoopConf::set); Review Comment: This could be a breaking change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6308) add num_commits_after_last_request to flink
[ https://issues.apache.org/jira/browse/HUDI-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6308: - Labels: pull-request-available (was: ) > add num_commits_after_last_request to flink > --- > > Key: HUDI-6308 > URL: https://issues.apache.org/jira/browse/HUDI-6308 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] eric9204 opened a new pull request, #8871: [HUDI-6308]add num_commits_after_last_request to flink
eric9204 opened a new pull request, #8871: URL: https://github.com/apache/hudi/pull/8871 ### Change Logs None ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6258) support olap engine query mor table in table name without ro/rt suffix
[ https://issues.apache.org/jira/browse/HUDI-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728532#comment-17728532 ] xy edited comment on HUDI-6258 at 6/2/23 2:19 AM: -- Fixed via master branch:3f9c45fdfa9b03e8092db07188b76c5931475733 was (Author: xuzifu): master commit:3f9c45fdfa9b03e8092db07188b76c5931475733 > support olap engine query mor table in table name without ro/rt suffix > -- > > Key: HUDI-6258 > URL: https://issues.apache.org/jira/browse/HUDI-6258 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > when we query mor table with olap engine such as starrocks、doris,we can get > data only from rt/ro table,this is beacause hudi did not sync meta to > tablename,so need fix it to support query as above condition -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6258) support olap engine query mor table in table name without ro/rt suffix
[ https://issues.apache.org/jira/browse/HUDI-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy resolved HUDI-6258. -- > support olap engine query mor table in table name without ro/rt suffix > -- > > Key: HUDI-6258 > URL: https://issues.apache.org/jira/browse/HUDI-6258 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > when we query mor table with olap engine such as starrocks、doris,we can get > data only from rt/ro table,this is beacause hudi did not sync meta to > tablename,so need fix it to support query as above condition -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6258) support olap engine query mor table in table name without ro/rt suffix
[ https://issues.apache.org/jira/browse/HUDI-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728532#comment-17728532 ] xy commented on HUDI-6258: -- master commit:3f9c45fdfa9b03e8092db07188b76c5931475733 > support olap engine query mor table in table name without ro/rt suffix > -- > > Key: HUDI-6258 > URL: https://issues.apache.org/jira/browse/HUDI-6258 > Project: Apache Hudi > Issue Type: Bug > Components: hive >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > when we query mor table with olap engine such as starrocks、doris,we can get > data only from rt/ro table,this is beacause hudi did not sync meta to > tablename,so need fix it to support query as above condition -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] zhangyue19921010 commented on pull request #6868: [Hudi-4882] Multiple ordering fields and null value update for partial update to handle out-of-order events
zhangyue19921010 commented on PR #6868: URL: https://github.com/apache/hudi/pull/6868#issuecomment-1573024411 @fengjian428 would u mind to rebase master ? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
stream2000 commented on code in PR #8745: URL: https://github.com/apache/hudi/pull/8745#discussion_r1213840107 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java: ## @@ -298,6 +298,22 @@ protected void syncHoodieTable(String tableName, boolean useRealtimeInputFormat, LOG.info("Sync complete for " + tableName); } + private boolean needToSyncAllPartitions(Option lastCommitTimeSynced) { +if (!lastCommitTimeSynced.isPresent()) { + return true; +} +if (config.getBoolean(META_SYNC_USE_STATE_TRANSIENT_TIME)) { + // If we use state transient time to sync partitions and the last commit time synced is before latest archive time + // We need to fall back to list all partitions instead of load the whole archive timeline + Option latestArchiveTime = syncClient.getLastArchiveTime(); Review Comment: Maybe we need to scan some archive logs here, at least those created after `lastCommitTimeSynced`. Otherwise, we need to sync all partitions every time we do the archive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6308) add num_commits_after_last_request to flink
eric created HUDI-6308: -- Summary: add num_commits_after_last_request to flink Key: HUDI-6308 URL: https://issues.apache.org/jira/browse/HUDI-6308 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: eric Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] Riddle4045 commented on issue #8870: [SUPPORT] Trino returns 0 rows when reading Hudi tables written by Flink 1.16
Riddle4045 commented on issue #8870: URL: https://github.com/apache/hudi/issues/8870#issuecomment-1573003923 > The compaction is executed async by default every 5 delta_commit on the table, did you have any chance to see the Parquet files already? @danny0405 no, there were total 6 commits, no compaction - is there a setting to toggle it, maybe it's turned off by default in Flink? I can also share the `.hoodie` folder if it helps you understand what's going on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
danny0405 commented on code in PR #8867: URL: https://github.com/apache/hudi/pull/8867#discussion_r1213826815 ## hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java: ## @@ -641,6 +643,26 @@ public void testSyncWithSchema(String syncMode, String enablePushDown) throws Ex "The last commit that was synced should be updated in the TBLPROPERTIES"); } + @ParameterizedTest + @MethodSource("syncModeAndEnablePushDown") + public void testSyncTimestamp(String syncMode, String enablePushDown) throws Exception { +hiveSyncProps.setProperty(HIVE_SYNC_MODE.key(), syncMode); Review Comment: Can we add a test case similar with `testSchemaConvertTimestampMicros`, there is no need to add avro schema files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8870: [SUPPORT] Trino returns 0 rows when reading Hudi tables written by Flink 1.16
danny0405 commented on issue #8870: URL: https://github.com/apache/hudi/issues/8870#issuecomment-1572999177 The compaction is executed async by default every 5 delta_commit on the table, did you have any chance to see the Parquet files already? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6256) fix the data table archiving and MDT cleaning config conflict
[ https://issues.apache.org/jira/browse/HUDI-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6256. Fix Version/s: 0.14.0 Resolution: Fixed Fixed via master branch: 32adbe4dfb2a0976cb312c2fa14eb49f5a29a151 > fix the data table archiving and MDT cleaning config conflict > - > > Key: HUDI-6256 > URL: https://issues.apache.org/jira/browse/HUDI-6256 > Project: Apache Hudi > Issue Type: Bug >Reporter: yonghua jian >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > fix the data table archiving and MDT cleaning config conflict -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 merged pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
danny0405 merged PR #8792: URL: https://github.com/apache/hudi/pull/8792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf… (#8792)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 32adbe4dfb2 [HUDI-6256] Fix the data table archiving and MDT cleaning config conf… (#8792) 32adbe4dfb2 is described below commit 32adbe4dfb2a0976cb312c2fa14eb49f5a29a151 Author: flashJd AuthorDate: Fri Jun 2 09:22:17 2023 +0800 [HUDI-6256] Fix the data table archiving and MDT cleaning config conf… (#8792) * Fix the data table archiving and MDT cleaning config conflict * Takes the MDT cleaning num commits as min(3, num_commits_DT), while 3 is the hardcode max cleaning num commits for MDT - Co-authored-by: Danny Chan --- .../hudi/metadata/HoodieMetadataWriteUtils.java| 2 +- .../functional/TestHoodieBackedMetadata.java | 40 ++ .../client/functional/TestHoodieMetadataBase.java | 2 +- 3 files changed, 42 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java index 5221f6523b0..df951ff3796 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java @@ -93,7 +93,7 @@ public class HoodieMetadataWriteUtils { .withCleanerParallelism(parallelism) .withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) .withFailedWritesCleaningPolicy(failedWritesCleaningPolicy) -.retainCommits(DEFAULT_METADATA_CLEANER_COMMITS_RETAINED) +.retainCommits(Math.min(writeConfig.getCleanerCommitsRetained(), DEFAULT_METADATA_CLEANER_COMMITS_RETAINED)) .build()) // we will trigger archive manually, to ensure only regular writer invokes it .withArchivalConfig(HoodieArchivalConfig.newBuilder() diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java index 10b134887c4..b540f97d806 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java @@ -538,6 +538,46 @@ public class TestHoodieBackedMetadata extends TestHoodieMetadataBase { assertEquals("004", metadataTimeline.getCommitsTimeline().firstInstant().get().getTimestamp()); } + @ParameterizedTest + @EnumSource(HoodieTableType.class) + public void testMetadataArchivalCleanConfig(HoodieTableType tableType) throws Exception { +init(tableType, false); +writeConfig = getWriteConfigBuilder(true, true, false) +.withMetadataConfig(HoodieMetadataConfig.newBuilder() +.enable(true) +.enableMetrics(false) +.withMaxNumDeltaCommitsBeforeCompaction(1) +.build()) +.withCleanConfig(HoodieCleanConfig.newBuilder() +.retainCommits(1) +.build()) +.withArchivalConfig(HoodieArchivalConfig.newBuilder() +.archiveCommitsWith(2, 3) +.build()) +.build(); +initWriteConfigAndMetatableWriter(writeConfig, true); + +AtomicInteger commitTime = new AtomicInteger(1); +// Trigger 4 regular writes in data table. +for (int i = 1; i <= 4; i++) { + doWriteOperation(testTable, "00" + (commitTime.getAndIncrement()), INSERT); +} + +// The earliest deltacommit in the metadata table should be "001", +// and the "00" init deltacommit should be archived. +HoodieTableMetaClient metadataMetaClient = HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(metadataTableBasePath).build(); +HoodieActiveTimeline metadataTimeline = metadataMetaClient.reloadActiveTimeline(); +assertEquals("001", metadataTimeline.getCommitsTimeline().firstInstant().get().getTimestamp()); + +getHoodieWriteClient(writeConfig); +// Trigger data table archive, should archive "001", "002" +archiveDataTable(writeConfig, HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(basePath).build()); +// Trigger a regular write operation. metadata timeline archival should kick in and catch up with data table. +doWriteOperation(testTable, "00" + (commitTime.getAndIncrement()), INSERT); +metadataTimeline = metadataMetaClient.reloadActiveTimeline(); +assertEquals("003", metadataTimeline.getCommitsTimeline().firstInstant().get().getTimestamp()); + } +
[GitHub] [hudi] XuQianJin-Stars commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
XuQianJin-Stars commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1572974659 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (e8ca0d4121a -> 3f9c45fdfa9)
This is an automated email from the ASF dual-hosted git repository. forwardxu pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from e8ca0d4121a [HUDI-6303] Bump flink version to 1.16.2 and 1.17.1 (#8861) add 3f9c45fdfa9 [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix (#8795) No new revisions were added by this update. Summary of changes: .../src/main/java/org/apache/hudi/hive/HiveSyncTool.java| 5 + .../src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java | 6 ++ 2 files changed, 11 insertions(+)
[GitHub] [hudi] XuQianJin-Stars merged pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
XuQianJin-Stars merged PR #8795: URL: https://github.com/apache/hudi/pull/8795 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1572836228 ## CI report: * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 commented on issue #8870: [SUPPORT] Trino returns 0 rows when reading Hudi tables written by Flink 1.16
Riddle4045 commented on issue #8870: URL: https://github.com/apache/hudi/issues/8870#issuecomment-1572789850 possibly related to https://github.com/apache/hudi/issues/8038 @codope could you help me understand how to configure the table for read optimized queries? or is it something that Hudi Sync tool should handle out of the box - Not sure why I am not seeing any rows back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8869: Added logic to correctly verify partition keys for CustomAvroKeyGen
hudi-bot commented on PR #8869: URL: https://github.com/apache/hudi/pull/8869#issuecomment-1572785544 ## CI report: * 54977785e91e2ee46baddd399a0d1889a323c612 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1572772807 ## CI report: * 130523be1324218f56ce15ddc6ac3255e7cfcd9a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17550) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 opened a new issue, #8870: [SUPPORT] Trino returns 0 rows when reading Hudi tables written by Flink 1.16
Riddle4045 opened a new issue, #8870: URL: https://github.com/apache/hudi/issues/8870 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** TL;DR Trino returns 0 records from hudi table when I can see data in object store. I am writing hudi tables in ABFS - reduced code ```java DataStream fares = env.addSource(new TaxiFareGenerator()).map( event -> GenericRowData.of( event.getRideId(), event.getDriverId(), event.getTaxiId(), event.getStartTime(), event.getTip(), event.getTolls(), event.getTotalFare()//, //event.getPaymentType() )); String targetTable = "TaxiFare"; String outputPath = String.join("/",basePath, "hudi4"); Map options = new HashMap<>(); options.put(FlinkOptions.PATH.key(), outputPath); options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name()); HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable) .column("rideId BIGINT") .column("driverId BIGINT") .column("taxiId BIGINT") .column("startTime BIGINT") .column("tip FLOAT") .column("tolls FLOAT") .column("totalFare FLOAT") .pk("driverId") .options(options); builder.sink(fares, false); env.execute("Hudi Table"); ``` I sync these tables to HMS using Hudi-Sync-Tool. ``` 2023-06-01T13:15:09,757 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync complete for **hudi5_ro** 2023-06-01T13:15:09,757 INFO [main] org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table hudi5_rt with base path abfs://flink@.dfs.core.windows.net/flink/click_events/hudi4 of type MERGE_ON_READ 2023-06-01T13:15:11,977 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync table hudi5_rt for the first time. 2023-06-01T13:15:17,712 INFO [main] org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be null 2023-06-01T13:15:17,712 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync all partitions given the last commit time synced is empty or before the start of the active timeline. Listing all partitions in abfs://flink@.dfs.core.windows.net/flink/click_events/hudi4, file system: AzureBlobFileSystem{uri=abfs://flink@.dfs.core.windows.net, user='ispatw', primaryUserGroup='ispatw'} 2023-06-01T13:15:24,755 INFO [main] org.apache.hudi.hive.HiveSyncTool - Sync complete for **hudi5_rt** 2023-06-01T13:15:24,761 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Closed a connection to metastore, current connections: 0 ``` I can see data streaming into the ABFS location ![image](https://github.com/apache/hudi/assets/3648351/66b233bf-6b14-43b0-a462-d903895ad664) When I try to query it using Trino my tables have no records ![image](https://github.com/apache/hudi/assets/3648351/5d85d470-a818-450e-997a-f79ab4158475) **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.13 * trino : 410 * Storage (HDFS/S3/GCS..) : ABFS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
nsivabalan commented on code in PR #8758: URL: https://github.com/apache/hudi/pull/8758#discussion_r1213649019 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java: ## @@ -41,35 +44,23 @@ public interface HoodieTableMetadataWriter extends Serializable, AutoCloseable { * @param engineContext * @param indexPartitionInfos - information about partitions to build such as partition type and base instant time */ - void buildMetadataPartitions(HoodieEngineContext engineContext, List indexPartitionInfos); - - /** - * Initialize file groups for the given metadata partitions when indexing is requested. - * - * @param dataMetaClient - meta client for the data table - * @param metadataPartitions - metadata partitions for which file groups needs to be initialized - * @param instantTime- instant time of the index action - * @throws IOException - */ - void initializeMetadataPartitions(HoodieTableMetaClient dataMetaClient, List metadataPartitions, String instantTime) throws IOException; + void buildMetadataPartitions(HoodieEngineContext engineContext, List indexPartitionInfos) throws IOException; /** * Drop the given metadata partitions. * - * @param metadataPartitions - * @throws IOException + * @param metadataPartitions List of MDT partitions to drop + * @throws IOException on failures */ void dropMetadataPartitions(List metadataPartitions) throws IOException; /** * Update the metadata table due to a COMMIT operation. * - * @param commitMetadata commit metadata of the operation of interest. - * @param instantTime instant time of the commit. - * @param isTableServiceAction true if caller is a table service. false otherwise. Only regular write operations can trigger metadata table services and this argument - * will assist in this. + * @param commitMetadata commit metadata of the operation of interest. + * @param instantTimeinstant time of the commit. */ - void update(HoodieCommitMetadata commitMetadata, String instantTime, boolean isTableServiceAction); + void update(HoodieCommitMetadata commitMetadata, HoodieData writeStatuses, String instantTime); Review Comment: nope. previously we were using this just to trigger compaction and cleaning within the update(). but now, we have made the performTableService as a separate method and is invoked separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
hudi-bot commented on PR #8574: URL: https://github.com/apache/hudi/pull/8574#issuecomment-1572681066 ## CI report: * f71ca7ad4339c60719c97f3d54339b6a7bd5205f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8868: [HUDI-6278] Fixed the use of DynamoDBLockConfig class
hudi-bot commented on PR #8868: URL: https://github.com/apache/hudi/pull/8868#issuecomment-1572673768 ## CI report: * b6f86c770f7e35d7488cff0066d2d760453eb931 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17563) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1572673644 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 0328e76358dd170d62b94fd286a9ffb728516429 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17562) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni commented on pull request #8740: [HUDI-6231] Handle glue comments
parisni commented on PR #8740: URL: https://github.com/apache/hudi/pull/8740#issuecomment-1572624810 > @parisni Hi, do we have plan to push-forward this feature? Yes I do. I am currently in vacations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6253) Treat full bootstrap table as regular table
[ https://issues.apache.org/jira/browse/HUDI-6253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler reassigned HUDI-6253: - Assignee: (was: Jonathan Vexler) > Treat full bootstrap table as regular table > --- > > Key: HUDI-6253 > URL: https://issues.apache.org/jira/browse/HUDI-6253 > Project: Apache Hudi > Issue Type: Improvement > Components: bootstrap >Reporter: Jonathan Vexler >Priority: Major > > Bootstrap tables have a performance hit compared to regular tables. If you > bootstrap with full bootstrap mode, I think we should just treat the table > like a regular table. I think the easiest way to do this would be to prevent > setting bootstrap base path in the tableconfig. If that isn't possible, then > we could add another table config stating if it has metadata only bootstrap > files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
codope commented on code in PR #8758: URL: https://github.com/apache/hudi/pull/8758#discussion_r1213269335 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java: ## @@ -99,7 +90,25 @@ public interface HoodieTableMetadataWriter extends Serializable, AutoCloseable { * Deletes the given metadata partitions. This path reuses DELETE_PARTITION operation. * * @param instantTime - instant time when replacecommit corresponding to the drop will be recorded in the metadata timeline - * @param partitions - list of {@link MetadataPartitionType} to drop + * @param partitions - list of {@link MetadataPartitionType} to drop */ void deletePartitions(String instantTime, List partitions); + + /** + * It returns write client for metadata table. + */ + BaseHoodieWriteClient getWriteClient(); Review Comment: rename to `getMetadataTableWriteClient` for clarity? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java: ## @@ -99,7 +90,25 @@ public interface HoodieTableMetadataWriter extends Serializable, AutoCloseable { * Deletes the given metadata partitions. This path reuses DELETE_PARTITION operation. * * @param instantTime - instant time when replacecommit corresponding to the drop will be recorded in the metadata timeline - * @param partitions - list of {@link MetadataPartitionType} to drop + * @param partitions - list of {@link MetadataPartitionType} to drop */ void deletePartitions(String instantTime, List partitions); + + /** + * It returns write client for metadata table. + */ + BaseHoodieWriteClient getWriteClient(); + + /** + * Returns true if the metadata table is initialized. + */ + boolean isInitialized(); Review Comment: Is it needed? Can we not get this from table config? If MDT is initialized then we should have some MDT partition as value for `hoodie.table.metadata.partitions` or `hoodie.table.metadata.partitions.inflight` right? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java: ## @@ -41,35 +44,23 @@ public interface HoodieTableMetadataWriter extends Serializable, AutoCloseable { * @param engineContext * @param indexPartitionInfos - information about partitions to build such as partition type and base instant time */ - void buildMetadataPartitions(HoodieEngineContext engineContext, List indexPartitionInfos); - - /** - * Initialize file groups for the given metadata partitions when indexing is requested. - * - * @param dataMetaClient - meta client for the data table - * @param metadataPartitions - metadata partitions for which file groups needs to be initialized - * @param instantTime- instant time of the index action - * @throws IOException - */ - void initializeMetadataPartitions(HoodieTableMetaClient dataMetaClient, List metadataPartitions, String instantTime) throws IOException; + void buildMetadataPartitions(HoodieEngineContext engineContext, List indexPartitionInfos) throws IOException; /** * Drop the given metadata partitions. * - * @param metadataPartitions - * @throws IOException + * @param metadataPartitions List of MDT partitions to drop + * @throws IOException on failures */ void dropMetadataPartitions(List metadataPartitions) throws IOException; /** * Update the metadata table due to a COMMIT operation. * - * @param commitMetadata commit metadata of the operation of interest. - * @param instantTime instant time of the commit. - * @param isTableServiceAction true if caller is a table service. false otherwise. Only regular write operations can trigger metadata table services and this argument - * will assist in this. + * @param commitMetadata commit metadata of the operation of interest. + * @param instantTimeinstant time of the commit. */ - void update(HoodieCommitMetadata commitMetadata, String instantTime, boolean isTableServiceAction); + void update(HoodieCommitMetadata commitMetadata, HoodieData writeStatuses, String instantTime); Review Comment: Why remove `isTableServiceAction`? Wouldn't we want to distinguish the update call due to regular ingestion writer from table service writer? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -18,14 +18,19 @@ package org.apache.hudi.metadata; +import org.apache.avro.specific.SpecificRecordBase; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; Review Comment: nit: if we can avoid re-ordering imports, it would make review easier. Also, I think we put hudi imports first. ##
[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
hudi-bot commented on PR #8526: URL: https://github.com/apache/hudi/pull/8526#issuecomment-1572527934 ## CI report: * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
prashantwason commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1213476749 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { */ @Override public void update(HoodieRollbackMetadata rollbackMetadata, String instantTime) { -if (enabled && metadata != null) { - // Is this rollback of an instant that has been synced to the metadata table? - String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0); - boolean wasSynced = metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant)); - if (!wasSynced) { -// A compaction may have taken place on metadata table which would have included this instant being rolled back. -// Revisit this logic to relax the compaction fencing : https://issues.apache.org/jira/browse/HUDI-2458 -Option latestCompaction = metadata.getLatestCompactionTime(); -if (latestCompaction.isPresent()) { - wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get()); -} +// The commit which is being rolled back on the dataset +final String commitInstantTime = rollbackMetadata.getCommitsRollback().get(0); +// Find the deltacommits since the last compaction +Option> deltaCommitsInfo = + CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline()); +if (!deltaCommitsInfo.isPresent()) { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no deltacommits on MDT", commitInstantTime, instantTime)); + return; +} + +// This could be a compaction or deltacommit instant (See CompactionUtils.getDeltaCommitsSinceLatestCompaction) +HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue(); +HoodieTimeline deltacommitsSinceCompaction = deltaCommitsInfo.get().getKey(); + +// The deltacommit that will be rolled back +HoodieInstant deltaCommitInstant = new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime); + +// The commit being rolled back should not be older than the latest compaction on the MDT. Compaction on MDT only occurs when all actions +// are completed on the dataset. Hence, this case implies a rollback of completed commit which should actually be handled using restore. +if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) { Review Comment: CompactionUtils.getDeltaCommitsSinceLatestCompaction returns a Pair. The value in that Pair can be either a DeltaCommit instant (if no compactions happened) or a Commit action (if a compaction was found). We only want to check for the compaction here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
prashantwason commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1213473799 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -837,10 +840,75 @@ public void update(HoodieCleanMetadata cleanMetadata, String instantTime) { */ @Override public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { -processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, -metadataMetaClient.getActiveTimeline(), restoreMetadata, getRecordsGenerationParams(), instantTime, -metadata.getSyncedInstantTime()), false); -closeInternal(); +dataMetaClient.reloadActiveTimeline(); + +// Since the restore has completed on the dataset, the latest write timeline instant is the one to which the +// restore was performed. This should be always present. +final String restoreToInstantTime = dataMetaClient.getActiveTimeline().getWriteTimeline() +.getReverseOrderedInstants().findFirst().get().getTimestamp(); + +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. +Option lastCompaction = metadataMetaClient.getCommitTimeline().filterCompletedInstants().lastInstant(); Review Comment: Yes, the BaseHoodieWriteClient also has this check. It is duplicated as we allow update() methods on the MDT to be called from outside the write client path. Its safer this way I suppose though duplicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
prashantwason commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1213473799 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -837,10 +840,75 @@ public void update(HoodieCleanMetadata cleanMetadata, String instantTime) { */ @Override public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { -processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, -metadataMetaClient.getActiveTimeline(), restoreMetadata, getRecordsGenerationParams(), instantTime, -metadata.getSyncedInstantTime()), false); -closeInternal(); +dataMetaClient.reloadActiveTimeline(); + +// Since the restore has completed on the dataset, the latest write timeline instant is the one to which the +// restore was performed. This should be always present. +final String restoreToInstantTime = dataMetaClient.getActiveTimeline().getWriteTimeline() +.getReverseOrderedInstants().findFirst().get().getTimestamp(); + +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. +Option lastCompaction = metadataMetaClient.getCommitTimeline().filterCompletedInstants().lastInstant(); Review Comment: Yes, the BaseHoodieWriteClient also has this check. It is duplicated as we allow update() methods on the MDT to be called from outside. Its safer this way I suppose though duplicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
prashantwason commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1213470892 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -669,32 +669,51 @@ public void restoreToSavepoint() { * @param savepointTime Savepoint time to rollback to */ public void restoreToSavepoint(String savepointTime) { -boolean initialMetadataTableIfNecessary = config.isMetadataTableEnabled(); -if (initialMetadataTableIfNecessary) { +boolean initializeMetadataTableIfNecessary = config.isMetadataTableEnabled(); +if (initializeMetadataTableIfNecessary) { try { -// Delete metadata table directly when users trigger savepoint rollback if mdt existed and beforeTimelineStarts +// Delete metadata table directly when users trigger savepoint rollback if mdt existed and if the savePointTime is beforeTimelineStarts +// or before the oldest compaction on MDT. +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. String metadataTableBasePathStr = HoodieTableMetadata.getMetadataTableBasePath(config.getBasePath()); HoodieTableMetaClient mdtClient = HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(metadataTableBasePathStr).build(); -// Same as HoodieTableMetadataUtil#processRollbackMetadata +Option lastCompaction = mdtClient.getCommitTimeline().filterCompletedInstants().lastInstant(); Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.
prashantwason commented on code in PR #8604: URL: https://github.com/apache/hudi/pull/8604#discussion_r1213467835 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java: ## @@ -159,6 +162,13 @@ protected void commit(String instantTime, Maphttps://github.com/apache/hudi/pull/8684 where the new partition enabling has been changed to: 1. Use bulkInsert for initial commit 2. Always use a unique timestamp on MDT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
prashantwason commented on PR #8526: URL: https://github.com/apache/hudi/pull/8526#issuecomment-1572480457 @danny0405 PTAL again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
prashantwason commented on code in PR #8526: URL: https://github.com/apache/hudi/pull/8526#discussion_r1213452295 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java: ## @@ -152,98 +153,107 @@ private void addShutDownHook() { // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows // for max of Integer size private HoodieLogBlock readBlock() throws IOException { -int blockSize; -long blockStartPos = inputStream.getPos(); -try { - // 1 Read the total size of the block - blockSize = (int) inputStream.readLong(); -} catch (EOFException | CorruptedLogFileException e) { - // An exception reading any of the above indicates a corrupt block - // Create a corrupt block by finding the next MAGIC marker or EOF - return createCorruptBlock(blockStartPos); -} - -// We may have had a crash which could have written this block partially -// Skip blockSize in the stream and we should either find a sync marker (start of the next -// block) or EOF. If we did not find either of it, then this block is a corrupted block. -boolean isCorrupted = isBlockCorrupted(blockSize); -if (isCorrupted) { - return createCorruptBlock(blockStartPos); -} - -// 2. Read the version for this log format -HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion(); +long blockStartPos = 0; +long blockSize = 0; -// 3. Read the block type for a log block -HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion); +try { + blockStartPos = inputStream.getPos(); -// 4. Read the header for a log block, if present + // 1 Read the total size of the block + blockSize = inputStream.readLong(); + + // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the + // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we + // find the corrupted block eventually. + + // 2. Read the version for this log format + HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion(); + + // 3. Read the block type for a log block + HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion); + + // 4. Read the header for a log block, if present + Map header = + nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + + // 5. Read the content length for the content + // Fallback to full-block size if no content-length + // TODO replace w/ hasContentLength + long contentLength = + nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize; + checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength); + + // 6. Read the content or skip content based on IO vs Memory trade-off by client + long contentPosition = inputStream.getPos(); + boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION; + Option content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily); + + // 7. Read footer if any + Map footer = + nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + + // 8. Read log block length, if present. This acts as a reverse pointer when traversing a + // log file in reverse + if (nextBlockVersion.hasLogBlockLength()) { +long currentPos = inputStream.getPos(); +long logBlockLength = inputStream.readLong(); +if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) { + return createCorruptBlock(blockStartPos); +} + } -Map header = -nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + // 9. Read the log block end position in the log file + long blockEndPos = inputStream.getPos(); -// 5. Read the content length for the content -// Fallback to full-block size if no content-length -// TODO replace w/ hasContentLength -int contentLength = -nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize; + HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc = + new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos); -// 6. Read the content or skip content based on IO vs Memory trade-off by client -long contentPosition = inputStream.getPos(); -boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
prashantwason commented on code in PR #8526: URL: https://github.com/apache/hudi/pull/8526#discussion_r1213451923 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java: ## @@ -152,98 +153,107 @@ private void addShutDownHook() { // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows // for max of Integer size private HoodieLogBlock readBlock() throws IOException { -int blockSize; -long blockStartPos = inputStream.getPos(); -try { - // 1 Read the total size of the block - blockSize = (int) inputStream.readLong(); -} catch (EOFException | CorruptedLogFileException e) { - // An exception reading any of the above indicates a corrupt block - // Create a corrupt block by finding the next MAGIC marker or EOF - return createCorruptBlock(blockStartPos); -} - -// We may have had a crash which could have written this block partially -// Skip blockSize in the stream and we should either find a sync marker (start of the next -// block) or EOF. If we did not find either of it, then this block is a corrupted block. -boolean isCorrupted = isBlockCorrupted(blockSize); -if (isCorrupted) { - return createCorruptBlock(blockStartPos); -} - -// 2. Read the version for this log format -HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion(); +long blockStartPos = 0; +long blockSize = 0; -// 3. Read the block type for a log block -HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion); +try { + blockStartPos = inputStream.getPos(); -// 4. Read the header for a log block, if present + // 1 Read the total size of the block + blockSize = inputStream.readLong(); + + // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the + // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we + // find the corrupted block eventually. + + // 2. Read the version for this log format + HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion(); + + // 3. Read the block type for a log block + HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion); + + // 4. Read the header for a log block, if present + Map header = + nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + + // 5. Read the content length for the content + // Fallback to full-block size if no content-length + // TODO replace w/ hasContentLength + long contentLength = + nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize; + checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength); + + // 6. Read the content or skip content based on IO vs Memory trade-off by client + long contentPosition = inputStream.getPos(); + boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION; + Option content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily); + + // 7. Read footer if any + Map footer = + nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + + // 8. Read log block length, if present. This acts as a reverse pointer when traversing a + // log file in reverse + if (nextBlockVersion.hasLogBlockLength()) { +long currentPos = inputStream.getPos(); +long logBlockLength = inputStream.readLong(); +if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) { + return createCorruptBlock(blockStartPos); +} + } -Map header = -nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null; + // 9. Read the log block end position in the log file + long blockEndPos = inputStream.getPos(); -// 5. Read the content length for the content -// Fallback to full-block size if no content-length -// TODO replace w/ hasContentLength -int contentLength = -nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize; + HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc = + new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos); -// 6. Read the content or skip content based on IO vs Memory trade-off by client -long contentPosition = inputStream.getPos(); -boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
[GitHub] [hudi] prashantwason commented on pull request #8487: [HUDI-6093] Use the correct partitionToReplacedFileIds during commit.
prashantwason commented on PR #8487: URL: https://github.com/apache/hudi/pull/8487#issuecomment-1572447665 @nsivabalan I fixed the conflict and all tests are passing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
hudi-bot commented on PR #8792: URL: https://github.com/apache/hudi/pull/8792#issuecomment-1572436343 ## CI report: * 683dc368e714ace1c44d741d642f1fe64b7910b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17548) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17559) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6253) Treat full bootstrap table as regular table
[ https://issues.apache.org/jira/browse/HUDI-6253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler reassigned HUDI-6253: - Assignee: Jonathan Vexler > Treat full bootstrap table as regular table > --- > > Key: HUDI-6253 > URL: https://issues.apache.org/jira/browse/HUDI-6253 > Project: Apache Hudi > Issue Type: Improvement > Components: bootstrap >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > > Bootstrap tables have a performance hit compared to regular tables. If you > bootstrap with full bootstrap mode, I think we should just treat the table > like a regular table. I think the easiest way to do this would be to prevent > setting bootstrap base path in the tableconfig. If that isn't possible, then > we could add another table config stating if it has metadata only bootstrap > files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5987) Clustering on bootstrap table fails when row writer is disabled
[ https://issues.apache.org/jira/browse/HUDI-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-5987. - Resolution: Fixed > Clustering on bootstrap table fails when row writer is disabled > --- > > Key: HUDI-5987 > URL: https://issues.apache.org/jira/browse/HUDI-5987 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap, table-service >Reporter: Sagar Sumit >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > As was pointed out in > [https://github.com/apache/hudi/pull/8206#pullrequestreview-1345104330,] > clustering on bootstrap table fails when row writer is disabled. The non-row > writer path does not handle bootstrap file paths. An attemp to fix this was > made in [https://github.com/apache/hudi/pull/8289] but it only succeeds for > Spark 3.2+ versions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
jonvex commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1213427044 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for Hudi
[GitHub] [hudi] hudi-bot commented on pull request #8867: [HUDI-6307] Sync TIMESTAMP_MILLIS to hive
hudi-bot commented on PR #8867: URL: https://github.com/apache/hudi/pull/8867#issuecomment-1572398800 ## CI report: * 6a8fa73c9e31a90f6249772b5b840acf42ae1df5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17560) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1572398057 ## CI report: * 130523be1324218f56ce15ddc6ac3255e7cfcd9a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17550) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
bvaradar commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1572371115 Sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
jonvex commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1572364211 @bvaradar do you think you would be able to review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] gamblewin commented on issue #8855: [SUPPORT][FLINK SQL] Can not insert join result into hudi table
gamblewin commented on issue #8855: URL: https://github.com/apache/hudi/issues/8855#issuecomment-1572347014 https://github.com/apache/hudi/assets/39117591/2def2c5a-39bc-4bfc-9d8b-0575d3fc3119;> it seems like it doesn't trigger a checkpoint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1572324712 ## CI report: * 130523be1324218f56ce15ddc6ac3255e7cfcd9a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8869: Added logic to correctly verify partition keys for CustomAvroKeyGen
hudi-bot commented on PR #8869: URL: https://github.com/apache/hudi/pull/8869#issuecomment-1572325298 ## CI report: * 54977785e91e2ee46baddd399a0d1889a323c612 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8869: Added logic to correctly verify partition keys for CustomAvroKeyGen
hudi-bot commented on PR #8869: URL: https://github.com/apache/hudi/pull/8869#issuecomment-1572312116 ## CI report: * 54977785e91e2ee46baddd399a0d1889a323c612 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1572310179 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 9e5504e078b93d1997cf901868234e36c69dd97e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17558) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cbts-alec-johnson commented on issue #8857: [SUPPORT] Column comments not syncing to AWS Glue Catalog
cbts-alec-johnson commented on issue #8857: URL: https://github.com/apache/hudi/issues/8857#issuecomment-1572291524 > Guess this is what you needed: https://github.com/apache/hudi/pull/8740/files Yes this is what I need. Also, I think you may have tagged this gcp-support instead of aws-support? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go opened a new pull request, #8869: Added logic to correctly verify partition keys for CustomAvroKeyGen
ad1happy2go opened a new pull request, #8869: URL: https://github.com/apache/hudi/pull/8869 ### Change Logs Added logic to correctly verify partition keys for CustomAvroKeyGenerator It will fix for Github issue - https://github.com/apache/hudi/issues/8372 ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bkosuru commented on issue #8800: GCP: Hudi stopped working in Dataproc Serverless 1.1.4
bkosuru commented on issue #8800: URL: https://github.com/apache/hudi/issues/8800#issuecomment-1572227348 It started working with hudi 0.13.1, did you fix anything in 0.13.1 to make it work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention
machadoluiz commented on issue #8824: URL: https://github.com/apache/hudi/issues/8824#issuecomment-1572195137 @ad1happy2go, the runtime increment happens gradually. In a specific example, it reached 2 minutes and 30 seconds around 300 commits (or 10 months). This poses a challenge for us, given it represents less than a year's worth of data. Is there any way that could improve this performance, or is this a trade-off we must deal with? Does Hudi perform operations using actual data or just metadata in the background? Does this mean that if we expand the size of the database, the cost/runtime will increase proportionally for managing the metadata? Or is this related only to the filenames, in which case this cost will be somewhat constant, regardless of the size of the database? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8866: [HUDI-6293] Make HoodieClusteringJob's parallelism of clustering_task…
hudi-bot commented on PR #8866: URL: https://github.com/apache/hudi/pull/8866#issuecomment-1572191506 ## CI report: * badb098e6bd6b0ee8b317514f08eb460659a8d93 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17557) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
hudi-bot commented on PR #8745: URL: https://github.com/apache/hudi/pull/8745#issuecomment-1572190692 ## CI report: * 62377696531fc1d4ee2b7c0c86897d1cfb6b5de9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17556) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8865: [HUDI-6306] dynamic catalog parameter
hudi-bot commented on PR #8865: URL: https://github.com/apache/hudi/pull/8865#issuecomment-1572082961 ## CI report: * 821e287f35e93974ae28f1e1e7a513c68749c281 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17554) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
stream2000 commented on PR #8745: URL: https://github.com/apache/hudi/pull/8745#issuecomment-1572050050 @danny0405 Hi danny, could you help to review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
xiarixiaoyao commented on code in PR #8851: URL: https://github.com/apache/hudi/pull/8851#discussion_r1211598407 ## hudi-spark-datasource/hudi-spark3.0.x/src/main/java/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieVectorizedParquetRecordReader.java: ## @@ -184,4 +209,62 @@ public boolean nextKeyValue() throws IOException { ++batchIdx; return true; } + + private void initializeInternal() throws IOException, UnsupportedOperationException { +// Check that the requested schema is supported. +missingColumns = new HashMap<>(); +List columns = requestedSchema.getColumns(); +List paths = requestedSchema.getPaths(); +for (int i = 0; i < requestedSchema.getFieldCount(); ++i) { + String[] colPath = paths.get(i); + if (!fileSchema.containsPath(colPath)) { +if (columns.get(i).getMaxDefinitionLevel() == 0) { + // Column is missing in data but the required data is non-nullable. This file is invalid. + throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(colPath)); +} +missingColumns.put(i, requestedSchema.getFields().get(i).getName()); + } +} +missed = schema != null && missingColumns.keySet().stream() +.allMatch(columnIndex -> Objects.nonNull(schema.findField(missingColumns.get(columnIndex)).getDefaultValue())); + } + + private void setColumnDefaultValue(int columnIndex) { Review Comment: ditto ## hudi-spark-datasource/hudi-spark3.0.x/src/main/java/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieVectorizedParquetRecordReader.java: ## @@ -184,4 +209,62 @@ public boolean nextKeyValue() throws IOException { ++batchIdx; return true; } + + private void initializeInternal() throws IOException, UnsupportedOperationException { Review Comment: Pls extract this method to reuse it for different version parquetRecordReader ## hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/SerDeHelper.java: ## @@ -144,6 +148,11 @@ private static void toJson(Types.RecordType record, Integer maxColumnId, Long ve if (field.doc() != null) { generator.writeStringField(DOC, field.doc()); } + // NOTE: The value of null field is JsonProperties.NULL_VALUE. + if (field.getDefaultValue() != null && field.getDefaultValue() != JsonProperties.NULL_VALUE) { +generator.writeFieldName(DEFAULT_VALUE); +generator.writeObject(field.getDefaultValue()); Review Comment: Do I need to verify the correctness of the defaultValue type before serialization, rather than just serializing it directly private static void writeDefaultValue(Types.Field field, JsonGenerator generator) throws IOException { if (field.getDefaultValue() == null || field.getDefaultValue() == JsonProperties.NULL_VALUE) { return; } switch (field.type().typeId()) { case RECORD: case ARRAY: case MAP: JsonNode defaultNode = JacksonUtils.toJsonNode(field.getDefaultValue()); generator.writeObjectField(DEFAULT_VALUE, defaultNode); break; case STRING: generator.writeStringField(DEFAULT_VALUE, field.getDefaultValue().toString()); break; case INT: generator.writeNumberField(DEFAULT_VALUE, Integer.valueOf(field.getDefaultValue().toString())); break; case LONG: generator.writeNumberField(DEFAULT_VALUE, Long.valueOf(field.getDefaultValue().toString())); break; case FLOAT: generator.writeNumberField(DEFAULT_VALUE, Double.valueOf(field.getDefaultValue().toString())); break; case DOUBLE: generator.writeNumberField(DEFAULT_VALUE, Double.valueOf(field.getDefaultValue().toString())); break; case BOOLEAN: generator.writeBooleanField(DEFAULT_VALUE, Boolean.valueOf(field.getDefaultValue().toString())); break; case DECIMAL: generator.writeBinaryField(DEFAULT_VALUE, (byte[])field.getDefaultValue()); break; case FIXED: generator.writeBinaryField(DEFAULT_VALUE, (byte[])field.getDefaultValue()); break; case BINARY: generator.writeBinaryField(DEFAULT_VALUE, (byte[])field.getDefaultValue()); break; case DATE: generator.writeNumberField(DEFAULT_VALUE, Integer.valueOf(field.getDefaultValue().toString())); break; case TIME: generator.writeNumberField(DEFAULT_VALUE, Long.valueOf(field.getDefaultValue().toString())); break; case TIMESTAMP: generator.writeNumberField(DEFAULT_VALUE, Long.valueOf(field.getDefaultValue().toString())); break; case UUID: generator.writeStringField(DEFAULT_VALUE,
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1572012983 ## CI report: * 23a574b64681c95c17db47d4c63c86d7e0215ba9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17528) * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
hudi-bot commented on PR #8574: URL: https://github.com/apache/hudi/pull/8574#issuecomment-1572011208 ## CI report: * dacba722974aa32f506626c106d90fa86d22cd23 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17375) * f71ca7ad4339c60719c97f3d54339b6a7bd5205f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1571998458 ## CI report: * 23a574b64681c95c17db47d4c63c86d7e0215ba9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17528) * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
hudi-bot commented on PR #8574: URL: https://github.com/apache/hudi/pull/8574#issuecomment-1571996984 ## CI report: * dacba722974aa32f506626c106d90fa86d22cd23 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17375) * f71ca7ad4339c60719c97f3d54339b6a7bd5205f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8487: [HUDI-6093] Use the correct partitionToReplacedFileIds during commit.
hudi-bot commented on PR #8487: URL: https://github.com/apache/hudi/pull/8487#issuecomment-1571996647 ## CI report: * 280515ea1c939f0afa7a4cd8a5593e55bd394648 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17553) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
KnightChess commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1571974348 before: https://github.com/apache/hudi/assets/20125927/9dc94884-347f-4284-8c9c-58d38a6c936a;> after: https://github.com/apache/hudi/assets/20125927/12f6bc24-42fe-4ee4-83ef-929e117331a2;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8868: [HUDI-6278] Fixed the use of DynamoDBLockConfig class
hudi-bot commented on PR #8868: URL: https://github.com/apache/hudi/pull/8868#issuecomment-1571922230 ## CI report: * b6f86c770f7e35d7488cff0066d2d760453eb931 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17563) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571921986 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * e712d534d9c0a16b3027706ed394de88ff2b293d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17555) * 0328e76358dd170d62b94fd286a9ffb728516429 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17562) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org