[GitHub] [hudi] ad1happy2go commented on issue #8253: [SUPPORT]HoodieJavaWriteClientExample Process finished with exit code 137 (interrupted by signal 9: SIGKILL) with jol-core 0.16
ad1happy2go commented on issue #8253: URL: https://github.com/apache/hudi/issues/8253#issuecomment-1621063805 @Mulavar Sorry for delay on this, But I am able to successfully run the HoodieJavaWriteClientExample with the this jdk version. Looks to be laptop issue only , so closing the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1621056203 ## CI report: * ef7585ba8d32d772500f31f95f3c04bfcac046e7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18326) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
hudi-bot commented on PR #9122: URL: https://github.com/apache/hudi/pull/9122#issuecomment-1621023436 ## CI report: * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18324) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1621023467 ## CI report: * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
danny0405 commented on code in PR #9122: URL: https://github.com/apache/hudi/pull/9122#discussion_r1252517410 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -144,18 +144,7 @@ public BaseHoodieTableFileIndex(HoodieEngineContext engineContext, this.engineContext = engineContext; this.fileStatusCache = fileStatusCache; -// The `shouldListLazily` variable controls how we initialize the TableFileIndex: -// - non-lazy/eager listing (shouldListLazily=false): all partitions and file slices will be loaded eagerly during initialization. -// - lazy listing (shouldListLazily=true): partitions listing will be done lazily with the knowledge from query predicate on partition -//columns. And file slices fetching only happens for partitions satisfying the given filter. -// -// In SparkSQL, `shouldListLazily` is controlled by option `REFRESH_PARTITION_AND_FILES_IN_INITIALIZATION`. -// In lazy listing case, if no predicate on partition is provided, all partitions will still be loaded. -if (shouldListLazily) { - this.tableMetadata = createMetadataTable(engineContext, metadataConfig, basePath); Review Comment: Ignore, it is created in `doRefresh` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
danny0405 commented on code in PR #9122: URL: https://github.com/apache/hudi/pull/9122#discussion_r1252517112 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -144,18 +144,7 @@ public BaseHoodieTableFileIndex(HoodieEngineContext engineContext, this.engineContext = engineContext; this.fileStatusCache = fileStatusCache; -// The `shouldListLazily` variable controls how we initialize the TableFileIndex: -// - non-lazy/eager listing (shouldListLazily=false): all partitions and file slices will be loaded eagerly during initialization. -// - lazy listing (shouldListLazily=true): partitions listing will be done lazily with the knowledge from query predicate on partition -//columns. And file slices fetching only happens for partitions satisfying the given filter. -// -// In SparkSQL, `shouldListLazily` is controlled by option `REFRESH_PARTITION_AND_FILES_IN_INITIALIZATION`. -// In lazy listing case, if no predicate on partition is provided, all partitions will still be loaded. -if (shouldListLazily) { - this.tableMetadata = createMetadataTable(engineContext, metadataConfig, basePath); Review Comment: The initialization of `tableMetadata` is removed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6476) Improve the performance of getAllPartitionPaths
[ https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6476. Resolution: Fixed Fixed via master branch: 72f047715fe8f2ad9ff19a31728fbfb761fbe0d9 > Improve the performance of getAllPartitionPaths > --- > > Key: HUDI-6476 > URL: https://issues.apache.org/jira/browse/HUDI-6476 > Project: Apache Hudi > Issue Type: Improvement > Components: hudi-utilities >Reporter: Wechar >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: After improvement.png, Before improvement.png > > > Currently Hudi will list all status of files in hudi table directory, which > can be avoid to improve the performance of getAllPartitionPaths, especially > for the non-partitioned table with many files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6476) Improve the performance of getAllPartitionPaths
[ https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6476: - Fix Version/s: 0.14.0 > Improve the performance of getAllPartitionPaths > --- > > Key: HUDI-6476 > URL: https://issues.apache.org/jira/browse/HUDI-6476 > Project: Apache Hudi > Issue Type: Improvement > Components: hudi-utilities >Reporter: Wechar >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: After improvement.png, Before improvement.png > > > Currently Hudi will list all status of files in hudi table directory, which > can be avoid to improve the performance of getAllPartitionPaths, especially > for the non-partitioned table with many files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 72f047715fe [HUDI-6476] Improve the performance of getAllPartitionPaths (#9121) 72f047715fe is described below commit 72f047715fe8f2ad9ff19a31728fbfb761fbe0d9 Author: Wechar Yu AuthorDate: Wed Jul 5 12:14:24 2023 +0800 [HUDI-6476] Improve the performance of getAllPartitionPaths (#9121) Currently Hudi will list all status of files in hudi table directory, which can be avoid to improve the performance of #getAllPartitionPaths, especially for the non-partitioned table with many files. What we change in this patch: * reduce a stage in getPartitionPathWithPathPrefix() * only check directory to find the PartitionMetadata * avoid listStatus of .hoodie/.hoodie_partition_metadata --- .../metadata/FileSystemBackedTableMetadata.java| 52 +- 1 file changed, 22 insertions(+), 30 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java index 69c237d6684..6a6f46a65ef 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java @@ -47,6 +47,7 @@ import java.util.List; import java.util.Map; import java.util.concurrent.CopyOnWriteArrayList; import java.util.stream.Collectors; +import java.util.stream.Stream; /** * Implementation of {@link HoodieTableMetadata} based file-system-backed table metadata. @@ -106,42 +107,33 @@ public class FileSystemBackedTableMetadata implements HoodieTableMetadata { // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel + // List all directories in parallel: + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - List dirToFileListing = engineContext.flatMap(pathsToList, path -> { + // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. + // and second entry holds optionally a directory path to be processed further. + List, Option>> result = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -return Arrays.stream(fileSystem.listStatus(path)); +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { + return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path)), Option.empty())); +} +return Arrays.stream(fileSystem.listStatus(path, p -> { + try { +return fileSystem.isDirectory(p) && !p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME); + } catch (IOException e) { +// noop + } + return false; +})).map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; }, listingParallelism); pathsToList.clear(); - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. - int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); - if (!dirToFileListing.isEmpty()) { -// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. -// and second entry holds optionally a directory path to be processed further. -engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); -List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { - FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); - if (fileStatus.isDirectory()) { -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { - return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath())), Option.empty()); -} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { - return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); -} - } else if
[GitHub] [hudi] danny0405 merged pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
danny0405 merged PR #9121: URL: https://github.com/apache/hudi/pull/9121 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
danny0405 commented on code in PR #9121: URL: https://github.com/apache/hudi/pull/9121#discussion_r1252515100 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -106,42 +107,33 @@ private List getPartitionPathWithPathPrefix(String relativePathPrefix) t // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel + // List all directories in parallel: + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - List dirToFileListing = engineContext.flatMap(pathsToList, path -> { + // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. + // and second entry holds optionally a directory path to be processed further. + List, Option>> result = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -return Arrays.stream(fileSystem.listStatus(path)); +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { + return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path)), Option.empty())); +} +return Arrays.stream(fileSystem.listStatus(path, p -> { + try { +return fileSystem.isDirectory(p) && !p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME); + } catch (IOException e) { +// noop + } + return false; +})).map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; }, listingParallelism); pathsToList.clear(); - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. - int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); - if (!dirToFileListing.isEmpty()) { -// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. -// and second entry holds optionally a directory path to be processed further. -engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); -List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { - FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); - if (fileStatus.isDirectory()) { -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { - return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath())), Option.empty()); -} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { - return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); -} - } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) { -String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent()); -return Pair.of(Option.of(partitionName), Option.empty()); - } - return Pair.of(Option.empty(), Option.empty()); -}, fileListingParallelism); - -partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()).map(entry -> entry.getKey().get()) -.collect(Collectors.toList())); + partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()).map(entry -> entry.getKey().get()) + .collect(Collectors.toList())); Review Comment: good point, the code looks much simpler! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
flashJd commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620990951 I'm confused why `insert overwrite hudi_cow_pt_tbl select 13, 'a13', 1100, '2021-12-09', '12' is a not dynamic partition writing?` the semantics should can be controled by config. we shoule clarify the conception static partition and dymanic partition 1)https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite iceberg dynamic and static partiton overwrite semantics 2)https://docs.databricks.com/delta/selective-overwrite.html#language-sql delta lake dynamic partiton overwrite semantics 3)1)https://hudi.apache.org/cn/docs/quick-start-guide/#insert-overwrite -- insert overwrite partitioned table with dynamic partition insert overwrite table hudi_cow_pt_tbl select 10, 'a10', 1100, '2021-12-09', '10'; -- insert overwrite partitioned table with static partition insert overwrite hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100; @nsivabalan @yihua @XuQianJin-Stars @KnightChess -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
hudi-bot commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620980724 ## CI report: * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313) * 48608a9eafa20f9fde6d414a4b4de50a2bcf6050 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18329) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
hudi-bot commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620976354 ## CI report: * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313) * 48608a9eafa20f9fde6d414a4b4de50a2bcf6050 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
danny0405 commented on code in PR #9106: URL: https://github.com/apache/hudi/pull/9106#discussion_r1252501322 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -91,8 +108,9 @@ public static HoodieWriteConfig createMetadataWriteConfig( .withCleanConfig(HoodieCleanConfig.newBuilder() .withAsyncClean(DEFAULT_METADATA_ASYNC_CLEAN) .withAutoClean(false) -.withCleanerParallelism(parallelism) -.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) +.withCleanerParallelism(defaultParallelism) +.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS) +.retainFileVersions(2) Review Comment: Even if Uber has been running for 6+ months, it does not mean the config work well for OSS, because while we migrating the Uber patches, many fixes and other nuances are introduced, I would suggest we move this change to the next release to keep the stability of existing MDT workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
danny0405 commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1252500090 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { */ @Override public void update(HoodieRollbackMetadata rollbackMetadata, String instantTime) { -if (enabled && metadata != null) { - // Is this rollback of an instant that has been synced to the metadata table? - String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0); - boolean wasSynced = metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant)); - if (!wasSynced) { -// A compaction may have taken place on metadata table which would have included this instant being rolled back. -// Revisit this logic to relax the compaction fencing : https://issues.apache.org/jira/browse/HUDI-2458 -Option latestCompaction = metadata.getLatestCompactionTime(); -if (latestCompaction.isPresent()) { - wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get()); -} +// The commit which is being rolled back on the dataset +final String commitInstantTime = rollbackMetadata.getCommitsRollback().get(0); +// Find the deltacommits since the last compaction +Option> deltaCommitsInfo = + CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline()); +if (!deltaCommitsInfo.isPresent()) { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no deltacommits on MDT", commitInstantTime, instantTime)); + return; +} + +// This could be a compaction or deltacommit instant (See CompactionUtils.getDeltaCommitsSinceLatestCompaction) +HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue(); +HoodieTimeline deltacommitsSinceCompaction = deltaCommitsInfo.get().getKey(); + +// The deltacommit that will be rolled back +HoodieInstant deltaCommitInstant = new HoodieInstant(false, HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime); + +// The commit being rolled back should not be older than the latest compaction on the MDT. Compaction on MDT only occurs when all actions +// are completed on the dataset. Hence, this case implies a rollback of completed commit which should actually be handled using restore. +if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) { + final String compactionInstantTime = compactionInstant.getTimestamp(); + if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, compactionInstantTime)) { +throw new HoodieMetadataException(String.format("Commit being rolled back %s is older than the latest compaction %s. " ++ "There are %d deltacommits after this compaction: %s", commitInstantTime, compactionInstantTime, +deltacommitsSinceCompaction.countInstants(), deltacommitsSinceCompaction.getInstants())); } +} - Map> records = - HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), - rollbackMetadata, getRecordsGenerationParams(), instantTime, - metadata.getSyncedInstantTime(), wasSynced); - commit(instantTime, records, false); - closeInternal(); +if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) { + LOG.info("Rolling back MDT deltacommit " + commitInstantTime); + if (!getWriteClient().rollback(commitInstantTime, instantTime)) { +throw new HoodieMetadataException("Failed to rollback deltacommit at " + commitInstantTime); + } +} else { + LOG.info(String.format("Ignoring rollback of instant %s at %s since there are no corresponding deltacommits on MDT", + commitInstantTime, instantTime)); } + +// Rollback of MOR table may end up adding a new log file. So we need to check for added files and add them to MDT +processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, metadataMetaClient.getActiveTimeline(), +rollbackMetadata, getRecordsGenerationParams(), instantTime, +metadata.getSyncedInstantTime(), true), false); Review Comment: Discussed offline, we need to track the inflight log files for cleaning anyway, but we have no good manner to fix that currently, needs think through ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at:
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620970058 ## CI report: * 984c3d691c3e7915fb1333ee823a641098774270 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318) * ef7585ba8d32d772500f31f95f3c04bfcac046e7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18326) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9119: [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13
danny0405 commented on issue #9119: URL: https://github.com/apache/hudi/issues/9119#issuecomment-1620968741 Sorry for the unstability, we will be more conservative about code reviewing and merging in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1252497920 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging { } } + private def deducePayloadClassNameLegacy(operation: String, tableType: String, insertMode: InsertMode): String = { +if (operation == UPSERT_OPERATION_OPT_VAL && + tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) { + // Validate duplicate key for COW, for MOR it will do the merge with the DefaultHoodieRecordPayload + // on reading. + // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when SparkRecordMerger is default + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == COW_TABLE_TYPE_OPT_VAL && + insertMode == InsertMode.STRICT){ + // Validate duplicate key for inserts to COW table when using strict insert mode. + classOf[ValidateDuplicateKeyPayload].getCanonicalName +} else { + classOf[OverwriteWithLatestAvroPayload].getCanonicalName +} Review Comment: By default, should we use `DefaultHoodieRecordPayload` instead ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
danny0405 commented on code in PR #9123: URL: https://github.com/apache/hudi/pull/9123#discussion_r1252496143 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter { if (mergedParams.contains(PRECOMBINE_FIELD.key())) { mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) } +if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) + && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { + // enable merge allow duplicates when operation type is insert + mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") Review Comment: I feel by default, we should never dedup for INSERT operation. That keeps the behavior in line with regular RDBMS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6475) Optimize TableNotFoundException message
[ https://issues.apache.org/jira/browse/HUDI-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6475: - Fix Version/s: 0.14.0 > Optimize TableNotFoundException message > --- > > Key: HUDI-6475 > URL: https://issues.apache.org/jira/browse/HUDI-6475 > Project: Apache Hudi > Issue Type: Improvement >Reporter: xiaoping.huang >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6475) Optimize TableNotFoundException message
[ https://issues.apache.org/jira/browse/HUDI-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6475. Resolution: Fixed Fixed via master branch: 2322ac9d22784df2ccebcbdf898286c16fe0c211 > Optimize TableNotFoundException message > --- > > Key: HUDI-6475 > URL: https://issues.apache.org/jira/browse/HUDI-6475 > Project: Apache Hudi > Issue Type: Improvement >Reporter: xiaoping.huang >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6475] Optimize TableNotFoundException message (#9120)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 2322ac9d227 [HUDI-6475] Optimize TableNotFoundException message (#9120) 2322ac9d227 is described below commit 2322ac9d22784df2ccebcbdf898286c16fe0c211 Author: huangxiaoping <1754789...@qq.com> AuthorDate: Wed Jul 5 11:18:04 2023 +0800 [HUDI-6475] Optimize TableNotFoundException message (#9120) --- .../src/main/java/org/apache/hudi/DataSourceUtils.java| 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java index c9c10fd7c7e..47a45479c09 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java +++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java @@ -55,9 +55,11 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; +import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; +import java.util.stream.Collectors; import static org.apache.hudi.common.util.CommitUtils.getCheckpointValueAsString; @@ -81,7 +83,7 @@ public class DataSourceUtils { } } -throw new TableNotFoundException("Unable to find a hudi table for the user provided paths."); +throw new TableNotFoundException(Arrays.stream(userProvidedPaths).map(Path::toString).collect(Collectors.joining(","))); } /**
[GitHub] [hudi] danny0405 merged pull request #9120: [HUDI-6475] Optimize TableNotFoundException message
danny0405 merged PR #9120: URL: https://github.com/apache/hudi/pull/9120 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
danny0405 commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620954248 Thanks for the contribution, it is greate if we can have details to explain that can help the reviewers to get the context more quickly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6329) Introduce UpdateStrategy for Flink to handle conflict between clustering/resize with update
[ https://issues.apache.org/jira/browse/HUDI-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6329. Resolution: Fixed Fixed via master branch: e8b1ddd708bc2ba99144f92d7533c7200f12509f > Introduce UpdateStrategy for Flink to handle conflict between > clustering/resize with update > --- > > Key: HUDI-6329 > URL: https://issues.apache.org/jira/browse/HUDI-6329 > Project: Apache Hudi > Issue Type: Sub-task > Components: flink, index >Reporter: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6329) Introduce UpdateStrategy for Flink to handle conflict between clustering/resize with update
[ https://issues.apache.org/jira/browse/HUDI-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6329: - Fix Version/s: 0.14.0 > Introduce UpdateStrategy for Flink to handle conflict between > clustering/resize with update > --- > > Key: HUDI-6329 > URL: https://issues.apache.org/jira/browse/HUDI-6329 > Project: Apache Hudi > Issue Type: Sub-task > Components: flink, index >Reporter: Jing Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index (#9087)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e8b1ddd708b [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index (#9087) e8b1ddd708b is described below commit e8b1ddd708bc2ba99144f92d7533c7200f12509f Author: Jing Zhang AuthorDate: Wed Jul 5 11:09:25 2023 +0800 [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index (#9087) * partitioner would detect new completed resize plan in #snapshotState * disable scheduling resize plan for insert write pipelines with consistent bucket index --- ...sistentHashingBucketClusteringPlanStrategy.java | 4 +- .../action/cluster/strategy/UpdateStrategy.java| 4 +- .../util/ConsistentHashingUpdateStrategyUtils.java | 107 +++ ...arkConsistentBucketDuplicateUpdateStrategy.java | 71 +- .../apache/hudi/configuration/OptionsResolver.java | 26 +++- .../org/apache/hudi/sink/StreamWriteFunction.java | 34 +++-- .../sink/bucket/BucketStreamWriteFunction.java | 2 +- .../sink/bucket/BucketStreamWriteOperator.java | 5 +- .../bucket/ConsistentBucketAssignFunction.java | 30 - .../ConsistentBucketStreamWriteFunction.java | 83 .../FlinkConsistentBucketUpdateStrategy.java | 150 + .../java/org/apache/hudi/sink/utils/Pipelines.java | 2 +- .../java/org/apache/hudi/util/ClusteringUtil.java | 5 +- .../org/apache/hudi/util/FlinkWriteClients.java| 8 +- .../org/apache/hudi/sink/TestWriteMergeOnRead.java | 40 ++ .../bucket/ITTestConsistentBucketStreamWrite.java | 23 +++- .../utils/BucketStreamWriteFunctionWrapper.java| 18 ++- ...ConsistentBucketStreamWriteFunctionWrapper.java | 81 +++ .../apache/hudi/sink/utils/ScalaCollector.java}| 32 +++-- .../sink/utils/StreamWriteFunctionWrapper.java | 22 --- .../test/java/org/apache/hudi/utils/TestData.java | 7 +- 21 files changed, 611 insertions(+), 143 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java index 59f9fcb81d1..49ab5f181ad 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java @@ -85,7 +85,7 @@ public abstract class BaseConsistentHashingBucketClusteringPlanStrategy p.getLeft().getPartitionPath().equals(partition)); if (isPartitionInClustering) { - LOG.info("Partition: " + partition + " is already in clustering, skip"); + LOG.info("Partition {} is already in clustering, skip.", partition); return Stream.empty(); } diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java index 4463f7887bb..1c61db4b572 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java @@ -32,8 +32,8 @@ import java.util.Set; public abstract class UpdateStrategy implements Serializable { protected final transient HoodieEngineContext engineContext; - protected final HoodieTable table; - protected final Set fileGroupsInPendingClustering; + protected HoodieTable table; + protected Set fileGroupsInPendingClustering; public UpdateStrategy(HoodieEngineContext engineContext, HoodieTable table, Set fileGroupsInPendingClustering) { this.engineContext = engineContext; diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java new file mode 100644 index 000..f8351d2fa93 --- /dev/null +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you
[GitHub] [hudi] danny0405 merged pull request #9087: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index
danny0405 merged PR #9087: URL: https://github.com/apache/hudi/pull/9087 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9087: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index
danny0405 commented on PR #9087: URL: https://github.com/apache/hudi/pull/9087#issuecomment-1620951822 Tests have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18287=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6423) Incremental cleaning should consider inflight compaction instant
[ https://issues.apache.org/jira/browse/HUDI-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6423. Resolution: Fixed Fixed via master branch: 07164406c44b4092eee810710a242d092c97bd58 > Incremental cleaning should consider inflight compaction instant > > > Key: HUDI-6423 > URL: https://issues.apache.org/jira/browse/HUDI-6423 > Project: Apache Hudi > Issue Type: Improvement >Reporter: zhuanshenbsj1 >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6423] Incremental cleaning should consider inflight compaction instant (#9038)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 07164406c44 [HUDI-6423] Incremental cleaning should consider inflight compaction instant (#9038) 07164406c44 is described below commit 07164406c44b4092eee810710a242d092c97bd58 Author: zhuanshenbsj1 <34104400+zhuanshenb...@users.noreply.github.com> AuthorDate: Wed Jul 5 11:05:57 2023 +0800 [HUDI-6423] Incremental cleaning should consider inflight compaction instant (#9038) * The CleanPlanner#getEarliestCommitToRetain should consider pending compaction instants. If the pending compaction got missed under incremental cleaning mode, some files may never be cleaned when the cleaner moved to a different partition: par1 | - par2 -> dc.1 compaction.2 dc.3 | dc.4 Assumes we have 3 delta commits and 1 pending compaction commit on the timeline, if the `EarliestCommitToRetain ` was recorded to dc.3, when the dc4(or subsequent instants) triggers cleaning, the cleaner just checks the timeline with dc.3, and the compaction.2 got skipped for ever if no subsequent mutations were made to partition par1. - Co-authored-by: Danny Chan --- .../action/clean/CleanPlanActionExecutor.java | 1 + .../hudi/table/action/clean/CleanPlanner.java | 2 +- .../java/org/apache/hudi/table/TestCleaner.java| 183 - .../table/timeline/HoodieDefaultTimeline.java | 7 + 4 files changed, 148 insertions(+), 45 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java index ba7c71b1356..b494df42b49 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java @@ -111,6 +111,7 @@ public class CleanPlanActionExecutor extends BaseActionExecutor implements Serializable { */ public Option getEarliestCommitToRetain() { return CleanerUtils.getEarliestCommitToRetain( -hoodieTable.getMetaClient().getActiveTimeline().getCommitsTimeline(), + hoodieTable.getMetaClient().getActiveTimeline().getCommitsAndCompactionTimeline(), config.getCleanerPolicy(), config.getCleanerCommitsRetained(), Instant.now(), diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java index d1e77613691..17a12dcc7ff 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java @@ -25,6 +25,7 @@ import org.apache.hudi.avro.model.HoodieCleanerPlan; import org.apache.hudi.avro.model.HoodieRequestedReplaceMetadata; import org.apache.hudi.avro.model.HoodieRollbackMetadata; import org.apache.hudi.client.HoodieTimelineArchiver; +import org.apache.hudi.client.SparkRDDReadClient; import org.apache.hudi.client.SparkRDDWriteClient; import org.apache.hudi.client.WriteStatus; import org.apache.hudi.client.common.HoodieSparkEngineContext; @@ -260,6 +261,97 @@ public class TestCleaner extends HoodieCleanerTestBase { } } + /** + * Test earliest commit to retain should be earlier than first pending compaction in incremental cleaning scenarios. + * + * @throws IOException + */ + @Test + public void testEarliestInstantToRetainForPendingCompaction() throws IOException { +HoodieWriteConfig writeConfig = getConfigBuilder().withPath(basePath) +.withFileSystemViewConfig(new FileSystemViewStorageConfig.Builder() +.withEnableBackupForRemoteFileSystemView(false) +.build()) +.withCleanConfig(HoodieCleanConfig.newBuilder() +.withAutoClean(false) + .withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) +.retainCommits(1) +.build()) +.withCompactionConfig(HoodieCompactionConfig.newBuilder() +.withInlineCompaction(false) +.withMaxNumDeltaCommitsBeforeCompaction(1) +.compactionSmallFileSize(1024 * 1024 * 1024) +.build()) +.withArchivalConfig(HoodieArchivalConfig.newBuilder() +.withAutoArchive(false) +.archiveCommitsWith(2,3) +.build()) +.withEmbeddedTimelineServerEnabled(false).build(); + +
[GitHub] [hudi] danny0405 merged pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
danny0405 merged PR #9038: URL: https://github.com/apache/hudi/pull/9038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
hudi-bot commented on PR #9122: URL: https://github.com/apache/hudi/pull/9122#issuecomment-1620942995 ## CI report: * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18324) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1620943023 ## CI report: * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620942945 ## CI report: * 984c3d691c3e7915fb1333ee823a641098774270 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318) * ef7585ba8d32d772500f31f95f3c04bfcac046e7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy closed pull request #9051: [HUDI-6436] Make the function of AlterHoodieTableChangeColumnCommand …
Zouxxyy closed pull request #9051: [HUDI-6436] Make the function of AlterHoodieTableChangeColumnCommand … URL: https://github.com/apache/hudi/pull/9051 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
hudi-bot commented on PR #9122: URL: https://github.com/apache/hudi/pull/9122#issuecomment-1620937689 ## CI report: * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
hudi-bot commented on PR #9123: URL: https://github.com/apache/hudi/pull/9123#issuecomment-1620937715 ## CI report: * 7708ff75ba467e2156b6396ee2886ec645b7b44f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6479) Update release docs and quick start guide around INSERT_INTO default behavior change
sivabalan narayanan created HUDI-6479: - Summary: Update release docs and quick start guide around INSERT_INTO default behavior change Key: HUDI-6479 URL: https://issues.apache.org/jira/browse/HUDI-6479 Project: Apache Hudi Issue Type: Improvement Components: spark-sql Reporter: sivabalan narayanan With [this|https://github.com/apache/hudi/pull/9123] patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6478) Simplify INSERT_INTO configs
[ https://issues.apache.org/jira/browse/HUDI-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6478: - Labels: pull-request-available (was: ) > Simplify INSERT_INTO configs > > > Key: HUDI-6478 > URL: https://issues.apache.org/jira/browse/HUDI-6478 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to > simplify them. > > hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and > datasource.operation.type. > > Rough notes: > > hoodie.sql.bulk.insert.enable: true | false. > > hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT > STRICT: we can't re-ingest same record again. will throw if found duplicates > to be ingested again. > NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if > its enabled). if not, exception will be thrown. > UPSERT: default insert.mode(until a week back where in we switch to make > bulk_insert the default for INSERT_INTO). will take care of de-dup. will use > OverwriteWithLatestAvroPayload(which means that we can update an existing > record across batches). > > datasource.operation.type: insert, bulk_insert, upsert > > drop.dups: Drop new incoming records if it already exists. > > Proposal: > > * We will introduce a new config named "hoodie.sql.write.operation" which > will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will > be "insert" for INSERT_INTO. > ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable". > * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation > type is "Insert" for both spark-sql and spark-ds. This will maintain > duplicates but still help w/ small file management with "insert"s. > * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" > whose valid values are "ignore, fail and drop". Make "ignore" as default. > "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can > use the fail/STRICT behavior if need be. > ** Deprecate hoodie.datasource.insert.drop.dups. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan opened a new pull request, #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql
nsivabalan opened a new pull request, #9123: URL: https://github.com/apache/hudi/pull/9123 ### Change Logs With the intent to simplify different config options with INSERT_INTO spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO like Operation type, insert mode, drop dupes, enable bulk insert configs. Here is what the simplification brings in. ``` - We will introduce a new config named "hoodie.sql.write.operation" which will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be "insert" for INSERT_INTO. - Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable". - Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates but still help w/ small file management with "insert"s. - Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will mimic "STRICT" mode we support as of now. - Deprecate hoodie.datasource.insert.drop.dups. ``` When both old and new configs are set, new config will take effect. When only new configs are set, new config will take effect. When neither is set, new configs and their default will take effect. When only old configs are set, old configs will take effect. Please do note that we are deprecating the use of these old configs. In 2 releases, we will completely remove these configs. So, would recommend users to migrate to new configs. Note: old refers to "hoodie.sql.insert.mode" and new config refers to "hoodie.sql.write.operation". Behavior change: With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change. ### Impact Usability will be improved for spark-sql users as we have deprecated few confusing configs and tried to align with spark datasource writes. Also, this brings in a behavior change as well. With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change. ### Risk level (write none, low medium or high below) medium ### Documentation Update We will have to call out the behavior change as part of our release docs and also update our quick start guide around the same. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6478) Simplify INSERT_INTO configs
sivabalan narayanan created HUDI-6478: - Summary: Simplify INSERT_INTO configs Key: HUDI-6478 URL: https://issues.apache.org/jira/browse/HUDI-6478 Project: Apache Hudi Issue Type: Improvement Components: spark-sql Reporter: sivabalan narayanan We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to simplify them. hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and datasource.operation.type. Rough notes: hoodie.sql.bulk.insert.enable: true | false. hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT STRICT: we can't re-ingest same record again. will throw if found duplicates to be ingested again. NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if its enabled). if not, exception will be thrown. UPSERT: default insert.mode(until a week back where in we switch to make bulk_insert the default for INSERT_INTO). will take care of de-dup. will use OverwriteWithLatestAvroPayload(which means that we can update an existing record across batches). datasource.operation.type: insert, bulk_insert, upsert drop.dups: Drop new incoming records if it already exists. Proposal: * We will introduce a new config named "hoodie.sql.write.operation" which will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be "insert" for INSERT_INTO. ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable". * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates but still help w/ small file management with "insert"s. * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can use the fail/STRICT behavior if need be. ** Deprecate hoodie.datasource.insert.drop.dups. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6477) Lazy fetching partition path & file slice when refresh in HoodieFileIndex
[ https://issues.apache.org/jira/browse/HUDI-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6477: - Labels: pull-request-available (was: ) > Lazy fetching partition path & file slice when refresh in HoodieFileIndex > - > > Key: HUDI-6477 > URL: https://issues.apache.org/jira/browse/HUDI-6477 > Project: Apache Hudi > Issue Type: Improvement >Reporter: zouxxyy >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] Zouxxyy opened a new pull request, #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…
Zouxxyy opened a new pull request, #9122: URL: https://github.com/apache/hudi/pull/9122 … HoodieFileIndex ### Change Logs Currently there is a lazy list mechanism in `hoodieFileIndex`, but it only takes effect during initialization. We can make it take effect when refresh. At present, almost all spark commands in HUDI will do refresh, such as DDL alter table operation, we don’t need to do list file at all ### Impact Lazy fetching partition path & file slice when refresh in HoodieFileIndex ### Risk level (write none, low medium or high below) medium ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6477) Lazy fetching partition path & file slice when refresh in HoodieFileIndex
zouxxyy created HUDI-6477: - Summary: Lazy fetching partition path & file slice when refresh in HoodieFileIndex Key: HUDI-6477 URL: https://issues.apache.org/jira/browse/HUDI-6477 Project: Apache Hudi Issue Type: Improvement Reporter: zouxxyy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] boneanxs commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
boneanxs commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620905284 You can still use dynamic partition, in this way: ```sql insert overwrite hudi_cow_pt_tbl partition(dt, hh) select 13, 'a13', 1100, '2021-12-09', '12' ``` the main point is that do we consider `insert overwrite hudi_cow_pt_tbl select 13, 'a13', 1100, '2021-12-09', '12'` is a dynamic partition writing? I think @leesf 's view makes sense, https://github.com/apache/hudi/pull/7365#issuecomment-1343707001 > @nsivabalan hi, here are my two cents: insert overwrite xxx values(xx,xxx) has very clear semantics, it means overwrite the entire table, insert overwrite xx partition(xx) values(xx,xxx) means insert overwrite partitions, but hudi handles overwrite partitions for overwrite table, which is a definite bug and i do not think we need to introduce a new operation for it. Also, this change can also keep the consistent behavior with spark sql, `insert overwrite hudi_cow_pt_tbl select 13, 'a13', 1100, '2021-12-09', '12'` will overwrite the whole table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620904971 ## CI report: * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN * 1d32092354e9065499631ed860a09a9c918c088d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18323) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering
hudi-bot commented on PR #9006: URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620904718 ## CI report: * b385ea4a4d4b7986ba27f5df352686652dc53c36 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18322) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
flashJd commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620903128 As we need the capacity to insert overwrite the whole partitioned table, why not use the config to enable it and make semantics forward compatible, meanwhile not lose the dynamic partition capacity -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
flashJd commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620899951 > @flashJd I noticed this issue before. Yes, this is a behavior change for `INSERT_OVERWRITE` without partition columns after #7365, but I think it's the right modification? if users don't specify partition columns, we'll consider it wants to overwrite all table? > > Spark sql also does the same way. i.e. `insert overwrite table_name values( #specify partition values)` will overwrite whole table. 1) Capcity of insert overwrite partitioned table with dynamic partition lost, we can only use the grammer `insert overwrite hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100` now. 2) Insert overwrite semantics not forward compatible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
Zouxxyy commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620893285 > It's greate if we can add a simple test case. done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620864476 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18321) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9066: [HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs
yihua commented on code in PR #9066: URL: https://github.com/apache/hudi/pull/9066#discussion_r1252417782 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java: ## @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.realtime; + +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieLogFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.HoodieRecordSizeEstimator; +import org.apache.hudi.common.util.HoodieTimer; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils; +import org.apache.hudi.io.storage.HoodieFileReader; + +import org.apache.avro.Schema; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.JobConf; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.HashSet; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +import static org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED; +import static org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.COMPACTION_LAZY_BLOCK_READ_ENABLED_PROP; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_COMPACTION_LAZY_BLOCK_READ_ENABLED; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_MAX_DFS_STREAM_BUFFER_SIZE; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_SPILLABLE_MAP_BASE_PATH; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.MAX_DFS_STREAM_BUFFER_SIZE_PROP; +import static org.apache.hudi.hadoop.config.HoodieRealtimeConfig.SPILLABLE_MAP_BASE_PATH_PROP; +import static org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getBaseFileReader; +import static org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes; +import static org.apache.hudi.internal.schema.InternalSchema.getEmptyInternalSchema; + +public class HoodieMergeOnReadSnapshotReader extends AbstractRealtimeRecordReader implements Iterator, AutoCloseable { + + private static final Logger LOG = LoggerFactory.getLogger(HoodieMergeOnReadSnapshotReader.class); + + private final String tableBasePath; + private final List logFilePaths; + private final String latestInstantTime; + private final Schema readerSchema; + private final JobConf jobConf; + private final HoodieMergedLogRecordScanner logRecordScanner; + private final HoodieFileReader baseFileReader; + private final Map logRecordsByKey; + private final Iterator recordsIterator; + private final ExternalSpillableMap mergedRecordsByKey; + + public HoodieMergeOnReadSnapshotReader(String tableBasePath, String baseFilePath, + List logFilePaths, + String latestInstantTime, + Schema readerSchema, + JobConf jobConf, long start, long length, String[] hosts) throws IOException { +super(getRealtimeSplit(tableBasePath, baseFilePath, logFilePaths, latestInstantTime, start, length, hosts), jobConf); +this.tableBasePath = tableBasePath; +this.logFilePaths = logFilePaths; +this.latestInstantTime = latestInstantTime; +this.readerSchema = readerSchema; +this.jobConf = jobConf; +HoodieTimer timer = new HoodieTimer().startTimer(); +this.logRecordScanner = getMergedLogRecordScanner(); +LOG.debug("Time taken to scan log records: {}",
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620837762 ## CI report: * 984c3d691c3e7915fb1333ee823a641098774270 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620837706 ## CI report: * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278) * 1d32092354e9065499631ed860a09a9c918c088d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18323) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620833117 ## CI report: * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278) * 1d32092354e9065499631ed860a09a9c918c088d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
nsivabalan commented on code in PR #9106: URL: https://github.com/apache/hudi/pull/9106#discussion_r1252401000 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -116,11 +134,10 @@ public static HoodieWriteConfig createMetadataWriteConfig( // Below config is only used if isLogCompactionEnabled is set. .withLogCompactionBlocksThreshold(writeConfig.getMetadataLogCompactBlocksThreshold()) .build()) -.withParallelism(parallelism, parallelism) -.withDeleteParallelism(parallelism) -.withRollbackParallelism(parallelism) -.withFinalizeWriteParallelism(parallelism) -.withAllowMultiWriteOnSameInstant(true) + .withStorageConfig(HoodieStorageConfig.newBuilder().hfileMaxFileSize(maxHFileSizeBytes) + .logFileMaxSize(maxLogFileSizeBytes).logFileDataBlockMaxSize(maxLogBlockSizeBytes).build()) +.withRollbackParallelism(defaultParallelism) +.withFinalizeWriteParallelism(defaultParallelism) Review Comment: did you remove .withAllowMultiWriteOnSameInstant(true) intentionally ? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -91,8 +108,9 @@ public static HoodieWriteConfig createMetadataWriteConfig( .withCleanConfig(HoodieCleanConfig.newBuilder() .withAsyncClean(DEFAULT_METADATA_ASYNC_CLEAN) .withAutoClean(false) -.withCleanerParallelism(parallelism) -.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) +.withCleanerParallelism(defaultParallelism) +.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS) +.retainFileVersions(2) Review Comment: I understand it could be a larger change, but file versions makes sense in general. If uber has been running w/ file versions for 6+ months, we should do a round of testing on our end, and can possibly proceed. but incremental cleaning may not kick in. so, for large MDTs, wondering will there be any latency hit ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -341,7 +341,11 @@ private void ensurePartitionsLoadedCorrectly(List partitionList) { long beginTs = System.currentTimeMillis(); // Not loaded yet try { - LOG.info("Building file system view for partitions " + partitionSet); + if (partitionSet.size() < 100) { +LOG.info("Building file system view for partitions: " + partitionSet); Review Comment: yes, may be we should reconsider the freq of logging here. for eg, log every every 100 partitions or something. not sure we will gain much by logging this for every partition. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java: ## @@ -537,7 +538,8 @@ public HoodieTableMetaClient getMetadataMetaClient() { } public Map stats() { -return metrics.map(m -> m.getStats(true, metadataMetaClient, this)).orElse(new HashMap<>()); +Set allMetadataPartitionPaths = Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet()); +return metrics.map(m -> m.getStats(true, metadataMetaClient, this, allMetadataPartitionPaths)).orElse(new HashMap<>()); Review Comment: HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient metaClient, HoodieTableMetadata metadata) reloads the timeline. can we move the reload to outside of the caller so that we don't reload for every MDT partition stats ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -176,7 +176,7 @@ private void initMetadataReader() { } try { - this.metadata = new HoodieBackedTableMetadata(engineContext, dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath()); + this.metadata = new HoodieBackedTableMetadata(engineContext, dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath(), true); Review Comment: rational is that, metadata writer itself is short lived just for committing one instant and so we should be good to enable re-use here? do we even expect to see any improvement here, since this is meant just for one write to MDT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering
hudi-bot commented on PR #9006: URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620806187 ## CI report: * 775343a4b7c9d72e3476ddee84078883af27f01e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18153) * b385ea4a4d4b7986ba27f5df352686652dc53c36 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18322) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering
hudi-bot commented on PR #9006: URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620801120 ## CI report: * 775343a4b7c9d72e3476ddee84078883af27f01e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18153) * b385ea4a4d4b7986ba27f5df352686652dc53c36 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index
hudi-bot commented on PR #9105: URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620798243 ## CI report: * fad064d3590670a75b8f68c5eca91e059d235241 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18317) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620798158 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 43c37c8a48763d8fdf71937fab4ccb900b313385 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18315) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
hudi-bot commented on PR #9121: URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620766916 ## CI report: * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18314) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1620722497 ## CI report: * e6568126aab0b098ccaac59e137e902d7a1070c3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18309) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
hudi-bot commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620714045 ## CI report: * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620665909 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320) * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18321) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620658010 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319) * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320) * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620652006 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319) * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620622515 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316) * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319) * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620617639 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * 7f04db759666f31a92888564d16216943674ac5b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312) * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316) * 035aa770c2fdeb9dcd9e91097f41904d39bca70f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
hudi-bot commented on PR #9064: URL: https://github.com/apache/hudi/pull/9064#issuecomment-1620617516 ## CI report: * b8418b74febf4551c0f79c7ebe71cf24916124e6 UNKNOWN * 3e0876320ac294a7da6c81a8b26630ed518606cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18307) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] GallonREX commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
GallonREX commented on issue #7925: URL: https://github.com/apache/hudi/issues/7925#issuecomment-1620580148 这是自动回复。谢谢您的邮件,您的邮件我已收到,我将尽快回复您。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
ad1happy2go commented on issue #7925: URL: https://github.com/apache/hudi/issues/7925#issuecomment-1620579931 @GallonREX The error what you getting `Cannot resolve conflicts for overlapping writes` is normally comes when you try to update the same file group concurrently. This should not be depending on versions. Even 0.12 should fail if multiple writers try to write in same file group. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8796: [HUDI-6129] Support rate limit for Spark streaming source
hudi-bot commented on PR #8796: URL: https://github.com/apache/hudi/pull/8796#issuecomment-1620576704 ## CI report: * 6c568f15e26e072d07cdb5de7e7a39fa2b9fbc6f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18308) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9120: [HUDI-6475] Optimize TableNotFoundException message
hudi-bot commented on PR #9120: URL: https://github.com/apache/hudi/pull/9120#issuecomment-1620571410 ## CI report: * ac6f163af4a9ab33b78a9304b25babc7caa90714 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18306) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9118: [HUDI-2141] Support flink write metrics
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1620520120 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * 6127808e39fcbf9e2acae98666887a455e0e926e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18304) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620477542 ## CI report: * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298) * 984c3d691c3e7915fb1333ee823a641098774270 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620476985 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311) * 43c37c8a48763d8fdf71937fab4ccb900b313385 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18315) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
hudi-bot commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620477463 ## CI report: * 72e9fc345a516c34387ba34d5fde2f8ea631b404 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18303) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index
hudi-bot commented on PR #9105: URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620477346 ## CI report: * fad064d3590670a75b8f68c5eca91e059d235241 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18317) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620477273 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * 7f04db759666f31a92888564d16216943674ac5b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312) * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620467909 ## CI report: * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298) * 984c3d691c3e7915fb1333ee823a641098774270 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index
hudi-bot commented on PR #9105: URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620467764 ## CI report: * fad064d3590670a75b8f68c5eca91e059d235241 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620467693 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217) * 7f04db759666f31a92888564d16216943674ac5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312) * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620467397 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310) * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311) * 43c37c8a48763d8fdf71937fab4ccb900b313385 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
hudi-bot commented on PR #9121: URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620456769 ## CI report: * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18314) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
hudi-bot commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620456654 ## CI report: * 2a046240c1e7c0a18f9b57c0845298ea65b72951 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18269) * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620456458 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217) * 7f04db759666f31a92888564d16216943674ac5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BBency commented on issue #9094: Async Clustering failing with errors for MOR table
BBency commented on issue #9094: URL: https://github.com/apache/hudi/issues/9094#issuecomment-1620436262 Approach 1: ![image](https://github.com/apache/hudi/assets/118782050/ddd0627a-3909-4237-bbca-89965860ebb0) Approach 2: ![image](https://github.com/apache/hudi/assets/118782050/9cb6bdde-4ba2-4dc7-82dd-1bc674943da1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Alowator commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT
Alowator commented on PR #9112: URL: https://github.com/apache/hudi/pull/9112#issuecomment-1620428039 If there is no any suggestions or questions, it could be merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
zhuanshenbsj1 commented on code in PR #9038: URL: https://github.com/apache/hudi/pull/9038#discussion_r1252072216 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java: ## @@ -111,14 +111,15 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) { LOG.info("Nothing to clean here. It is already clean"); return HoodieCleanerPlan.newBuilder().setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()).build(); } - LOG.info("Total Partitions to clean : " + partitionsToClean.size() + ", with policy " + config.getCleanerPolicy()); + LOG.info("Earliest commit to retain for clean : " + (earliestInstant.isPresent() ? earliestInstant.get().getTimestamp() : "null")); + LOG.info("Total partitions to clean : " + partitionsToClean.size() + ", with policy " + config.getCleanerPolicy()); int cleanerParallelism = Math.min(partitionsToClean.size(), config.getCleanerParallelism()); LOG.info("Using cleanerParallelism: " + cleanerParallelism); context.setJobStatus(this.getClass().getSimpleName(), "Generating list of file slices to be cleaned: " + config.getTableName()); Map>> cleanOpsWithPartitionMeta = context - .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), cleanerParallelism) + .map(partitionsToClean, partitionPathToClean -> Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean, earliestInstant)), cleanerParallelism) Review Comment: Before modifying the earliestCommitToRetain, it was calculated twice. ![image](https://github.com/apache/hudi/assets/34104400/38b9d3bb-53bd-46c6-af9f-ebc40fce1605) Due to non atomic operation, it is possible that the partition level calculation and the outer function calculation results are inconsistent, which can result in partition level cleaning exceeding the outer earliestCommitToRetain. This may causes incorrect reading results for snapshot read in activetimeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
hudi-bot commented on PR #9121: URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620397104 ## CI report: * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
hudi-bot commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620396976 ## CI report: * 2a046240c1e7c0a18f9b57c0845298ea65b72951 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18269) * 5b52b7900c734adba70ac16da20bdc23f21b01d0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci
hudi-bot commented on PR #9095: URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620396794 ## CI report: * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217) * 7f04db759666f31a92888564d16216943674ac5b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620396492 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310) * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620380794 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * Unknown: [CANCELED](TBD) * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310) * 5b354dd07b4381c270e17001a1010141bf7086e8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #9115: [HUDI-6469] Revert HUDI-6311
codope commented on PR #9115: URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620378064 > Hi @jonvex Can you elaborate a little more why to revert the changes? @danny0405 This reverts part of #8875 i.e. revert the behavior change of spark-sql insert into using bulk insert. So, with this revert, it will be back to upsert. But, we plan to add some new configs and deprecate the existing sql insert mode config. I've fixed all the test failures. We can land this once the CI is green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6476) Improve the performance of getAllPartitionPaths
[ https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6476: - Labels: pull-request-available (was: ) > Improve the performance of getAllPartitionPaths > --- > > Key: HUDI-6476 > URL: https://issues.apache.org/jira/browse/HUDI-6476 > Project: Apache Hudi > Issue Type: Improvement > Components: hudi-utilities >Reporter: Wechar >Priority: Major > Labels: pull-request-available > Attachments: After improvement.png, Before improvement.png > > > Currently Hudi will list all status of files in hudi table directory, which > can be avoid to improve the performance of getAllPartitionPaths, especially > for the non-partitioned table with many files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wecharyu opened a new pull request, #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths
wecharyu opened a new pull request, #9121: URL: https://github.com/apache/hudi/pull/9121 ### Change Logs Currently Hudi will list all status of files in hudi table directory, which can be avoid to improve the performance of getAllPartitionPaths, especially for the non-partitioned table with many files. What we change in this patch: - reduce a stage in `getPartitionPathWithPathPrefix()` - only check directory to find the PartitionMetadata ### Impact Performance improvement. ### Risk level (write none, low medium or high below) None. ### Documentation Update None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand
hudi-bot commented on PR #9116: URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620367087 ## CI report: * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org