Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]
hudi-bot commented on PR #10261: URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859065053 ## CI report: * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21546) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859045969 ## CI report: * a4a30405db038272466adea3ee51c2923fcd944e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21547) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859037963 ## CI report: * 53875b1c03bc5d2c761e57c998a3fb1c2e038b01 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21548) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859037954 ## CI report: * dd3845421a5be07fc18939cd23537a23662f7515 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542) * a4a30405db038272466adea3ee51c2923fcd944e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21547) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]
hudi-bot commented on PR #10261: URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859037918 ## CI report: * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN * c79d1211484ff1d8751cee0a01e60ca2b8cf84bf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21453) * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21546) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859032793 ## CI report: * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543) * 53875b1c03bc5d2c761e57c998a3fb1c2e038b01 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859032784 ## CI report: * dd3845421a5be07fc18939cd23537a23662f7515 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542) * a4a30405db038272466adea3ee51c2923fcd944e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]
hudi-bot commented on PR #10261: URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859032737 ## CI report: * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN * c79d1211484ff1d8751cee0a01e60ca2b8cf84bf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21453) * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]
hudi-bot commented on PR #10342: URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859029231 ## CI report: * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21544) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7183] Fix static insert overwrite partitions issue (#10254)
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 27ac42818bc [HUDI-7183] Fix static insert overwrite partitions issue (#10254) 27ac42818bc is described below commit 27ac42818bcf3768c2a6742ac0edfcb79c253e52 Author: Wechar Yu AuthorDate: Sun Dec 17 11:32:30 2023 +0800 [HUDI-7183] Fix static insert overwrite partitions issue (#10254) --- .../SparkInsertOverwriteCommitActionExecutor.java | 17 ++-- ...setBulkInsertOverwriteCommitActionExecutor.java | 18 ++-- .../sql/catalyst/catalog/HoodieCatalogTable.scala | 7 +- .../spark/sql/hudi/ProvidesHoodieConfig.scala | 83 ++ .../command/InsertIntoHoodieTableCommand.scala | 32 +-- .../apache/spark/sql/hudi/TestInsertTable.scala| 98 ++ 6 files changed, 177 insertions(+), 78 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java index d12efab229d..788e1040783 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java @@ -36,7 +36,7 @@ import org.apache.hudi.table.action.HoodieWriteMetadata; import org.apache.spark.Partitioner; -import java.util.Collections; +import java.util.Arrays; import java.util.Iterator; import java.util.List; import java.util.Map; @@ -81,14 +81,15 @@ public class SparkInsertOverwriteCommitActionExecutor @Override protected Map> getPartitionToReplacedFileIds(HoodieWriteMetadata> writeMetadata) { -if (writeMetadata.getWriteStatuses().isEmpty()) { - String staticOverwritePartition = config.getStringOrDefault(HoodieInternalConfig.STATIC_OVERWRITE_PARTITION_PATHS); - if (StringUtils.isNullOrEmpty(staticOverwritePartition)) { -return Collections.emptyMap(); - } else { -return Collections.singletonMap(staticOverwritePartition, getAllExistingFileIds(staticOverwritePartition)); - } +String staticOverwritePartition = config.getStringOrDefault(HoodieInternalConfig.STATIC_OVERWRITE_PARTITION_PATHS); +if (StringUtils.nonEmpty(staticOverwritePartition)) { + // static insert overwrite partitions + List partitionPaths = Arrays.asList(staticOverwritePartition.split(",")); + context.setJobStatus(this.getClass().getSimpleName(), "Getting ExistingFileIds of matching static partitions"); + return HoodieJavaPairRDD.getJavaPairRDD(context.parallelize(partitionPaths, partitionPaths.size()).mapToPair( + partitionPath -> Pair.of(partitionPath, getAllExistingFileIds(partitionPath.collectAsMap(); } else { + // dynamic insert overwrite partitions return HoodieJavaPairRDD.getJavaPairRDD(writeMetadata.getWriteStatuses().map(status -> status.getStat().getPartitionPath()).distinct().mapToPair(partitionPath -> Pair.of(partitionPath, getAllExistingFileIds(partitionPath.collectAsMap(); } diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java index c1fd952b106..67ba2027cbd 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java +++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java @@ -26,6 +26,7 @@ import org.apache.hudi.common.model.FileSlice; import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.config.HoodieInternalConfig; import org.apache.hudi.config.HoodieWriteConfig; @@ -33,7 +34,7 @@ import org.apache.hudi.data.HoodieJavaPairRDD; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; -import java.util.Collections; +import java.util.Arrays; import java.util.List; import java.util.Map; import java.util.stream.Collectors; @@ -60,14 +61,15 @@ public class DatasetBulkInsertOverwriteCommitActionExecutor extends BaseDatasetB @Override protected Map> getPartitionToReplacedFileIds(HoodieData writeStatuses) { -if (writeStatuses.isEmpty()) { - String staticOverwritePartition = writeConfig.
Re: [PR] [HUDI-7183] Fix static insert overwrite partitions issue [hudi]
bvaradar merged PR #10254: URL: https://github.com/apache/hudi/pull/10254 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]
hudi-bot commented on PR #10342: URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859021797 ## CI report: * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21544) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]
hudi-bot commented on PR #10342: URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859020883 ## CI report: * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859020875 ## CI report: * dd3845421a5be07fc18939cd23537a23662f7515 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]
hudi-bot commented on PR #10342: URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859014631 ## CI report: * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync
[ https://issues.apache.org/jira/browse/HUDI-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7237: - Labels: pull-request-available (was: ) > Minor Improvements to Schema Handling in Delta Sync > --- > > Key: HUDI-7237 > URL: https://issues.apache.org/jira/browse/HUDI-7237 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Priority: Major > Labels: pull-request-available > > There are a two minor items that we have run into running DeltaStreamer in > production. > 1. The number of times the schema is fetched is more than it needs to be and > can put unnecessary load on schema providers or increase file system reads > 2. SchemaProviders that return null target schemas on empty batches cause > null schema values in commits leading to unexpected issues later > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]
the-other-tim-brown opened a new pull request, #10342: URL: https://github.com/apache/hudi/pull/10342 ### Change Logs - Reduces number of calls to schema provider by properly using `orElseGet` in one case and reusing existing writer config instead of recomputing - Reuses existing hadoop configuration instance instead of creating new one when possible - Handles case where schema provider returns null and there is an empty batch ### Impact - Reduces number of calls to schema providers which can reduce load on the provider or cost of accessing file system - Fixes edge case where an empty batch could write a commit with a null value as the schema ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859012003 ## CI report: * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync
Timothy Brown created HUDI-7237: --- Summary: Minor Improvements to Schema Handling in Delta Sync Key: HUDI-7237 URL: https://issues.apache.org/jira/browse/HUDI-7237 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown There are a two minor items that we have run into running DeltaStreamer in production. 1. The number of times the schema is fetched is more than it needs to be and can put unnecessary load on schema providers or increase file system reads 2. SchemaProviders that return null target schemas on empty batches cause null schema values in commits leading to unexpected issues later -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859003258 ## CI report: * b319a7c193a79252dd18c5df782667d96c47a64a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541) * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859001927 ## CI report: * b319a7c193a79252dd18c5df782667d96c47a64a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541) * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859001909 ## CI report: * 380da64e8a28ae34f4315f6df34367f7f0c74cdd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21537) * dd3845421a5be07fc18939cd23537a23662f7515 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
hudi-bot commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859000751 ## CI report: * 380da64e8a28ae34f4315f6df34367f7f0c74cdd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21537) * dd3845421a5be07fc18939cd23537a23662f7515 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7236] Allow MIT to change partition path when using global index [hudi]
nsivabalan commented on code in PR #10337: URL: https://github.com/apache/hudi/pull/10337#discussion_r1428851434 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -182,6 +182,82 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } } + + /** + * Test MIT with global index. + * HUDI-7131 + */ + test("Test HUDI-7131") { Review Comment: lets name the test w/ some context. "Test MergeInto with global index" ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,6 +337,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo }) } + test("Test MergeInto with changing partition") { +withRecordType()(withTempDir { tmp => + withSQLConf("hoodie.index.type" -> "GLOBAL_SIMPLE") { +val sourceTable = generateTableName +val targetTable = generateTableName +spark.sql( + s""" + | create table $sourceTable + | using parquet + | partitioned by (partition) + | location '${tmp.getCanonicalPath}/$sourceTable' + | as + | select + | 1 as id, + | 2 as version, + | 'yes' as mergeCond, + | '2023-10-02' as partition + """.stripMargin +) +spark.sql(s"insert into $sourceTable values(2, 2, 'no', '2023-10-02')") +spark.sql(s"insert into $sourceTable values(3, 1, 'insert', '2023-10-01')") + +spark.sql( + s""" + | create table $targetTable ( + | id int, + | version int, + | mergeCond string, + | partition string + | ) using hudi + | partitioned by (partition) + | tblproperties ( + |'primaryKey' = 'id', + |'type' = 'cow' + | ) + | location '${tmp.getCanonicalPath}/$targetTable' + """.stripMargin) + +spark.sql(s"insert into $targetTable values(1, 1, 'insert', '2023-10-01')") +spark.sql(s"insert into $targetTable values(2, 1, 'insert', '2023-10-01')") + +spark.sql( + s""" + | merge into $targetTable t using + | (select * from $sourceTable) as s + | on t.id=s.id + | when matched and s.mergeCond = 'yes' then update set * + | when not matched then insert * + """.stripMargin) +checkAnswer(s"select id,version,partition from $targetTable order by id")( + Seq(1, 2, "2023-10-02"), + Seq(2, 1, "2023-10-01"), Review Comment: why s.mergeCond is required in the match clause? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -243,6 +247,56 @@ private static HoodieData> getExistingRecords( .getMergedRecords().iterator()); } + /** + * getExistingRecords will create records with expression payload so we overwrite the config. + * Additionally, we don't want to restore this value because the write will fail later on. + * We also need the keygenerator so we can figure out the partition path after expression payload + * evaluates the merge. + */ + private static BaseKeyGenerator maybeGetKeygenAndUpdatePayload(HoodieWriteConfig config) { +if (config.getPayloadClass().equals("org.apache.spark.sql.hudi.command.payload.ExpressionPayload")) { + config.setValue(HoodiePayloadConfig.PAYLOAD_CLASS_NAME.key(), HoodiePayloadConfig.PAYLOAD_CLASS_NAME.defaultValue()); + try { +return (BaseKeyGenerator) HoodieAvroKeyGeneratorFactory.createKeyGenerator(config.getProps()); + } catch (IOException e) { +throw new RuntimeException("KeyGenerator must inherit from BaseKeyGenerator to update a records partition path using spark sql merge into", e); + } +} +return null; + } + + /** + * Special merge handling for MIT + * We need to wait until after merging before we can add meta fields because + * ExpressionPayload does not allow rewriting + */ + private static Option> mergeIncomingWithExistingRecordWithExpressionPayload( + HoodieRecord incoming, + HoodieRecord existing, + Schema writeSchema, + Schema existingSchema, + Schema writeSchemaWithMetaFields, + HoodieWriteConfig config, + HoodieRecordMerger recordMerger, + BaseKeyGenerator keyGenerator) throws IOException { +Option> mergeResult = recordMerger.merge(existing, existingSchema, Review Comment: so, we don't need to meta fields for this merge is it? but for the merge in L323, we need the meta fields prepended. ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,6 +337,65 @@ cl
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]
nsivabalan commented on PR #10339: URL: https://github.com/apache/hudi/pull/10339#issuecomment-1858857129 @linliu-code : yes, we do have a fix for master here here https://github.com/apache/hudi/pull/10340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] ceshi [hudi]
zoomake closed issue #10341: ceshi URL: https://github.com/apache/hudi/issues/10341 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on PR #10255: URL: https://github.com/apache/hudi/pull/10255#issuecomment-1858800572 > Overall it looks solid to me. Only concern is that this code is very critical and complex, do we have enough tests to ensure the correctness? We have all the tests for each use case in `TestInputFormat` and `TestIncrementalInputSplits`, they are currently Flink UTs, somehow we need to refactoring the tests into the `hudi-common`. (an engine agnostic data set validation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428788123 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { + public static final String START_COMMIT_EARLIEST = "earliest"; + + private final HoodieTableMetaClient metaClient; + private final Option startTime; + private final Option endTime; + private final InstantRange.RangeType rangeType; + private final boolean skipCompaction; + private final boolean skipClustering; + private final int limit; + + private IncrementalQueryAnalyzer( + HoodieTableMetaClient metaClient, + String startTime, + String endTime, Review Comment: It could be, and that would induce empty dataset. -- This is an automated message from the Apache
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787998 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java: ## @@ -175,42 +187,111 @@ public Option getCompletionTime(String startTime) { * * By default, assumes there is at most 1 day time of duration for an instant to accelerate the queries. * - * @param startCompletionTime The start completion time. - * @param endCompletionTime The end completion time. + * @param readTimeline The read timeline. + * @param rangeStart The query range start completion time. + * @param rangeEnd The query range end completion time. + * @param rangeTypeThe range type. * - * @return The instant time set. + * @return The sorted instant time list. */ - public Set getStartTimeSet(String startCompletionTime, String endCompletionTime) { + public List getStartTimes( + HoodieTimeline readTimeline, + Option rangeStart, + Option rangeEnd, + InstantRange.RangeType rangeType) { // assumes any instant/transaction lasts at most 1 day to optimize the query efficiency. -return getStartTimeSet(startCompletionTime, endCompletionTime, s -> HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY)); +return getStartTimes(readTimeline, rangeStart, rangeEnd, rangeType, s -> HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY)); } /** * Queries the instant start time with given completion time range. * - * @param startCompletionTime The start completion time. - * @param endCompletionTime The end completion time. - * @param earliestStartTimeFunc The function to generate the earliest start time boundary - * with the minimum completion time {@code startCompletionTime}. + * @param rangeStart The query range start completion time. + * @param rangeEndThe query range end completion time. + * @param earliestInstantTimeFunc The function to generate the earliest start time boundary + *with the minimum completion time. * - * @return The instant time set. + * @return The sorted instant time list. */ - public Set getStartTimeSet(String startCompletionTime, String endCompletionTime, Function earliestStartTimeFunc) { -String startInstant = earliestStartTimeFunc.apply(startCompletionTime); + @VisibleForTesting + public List getStartTimes( + String rangeStart, + String rangeEnd, + Function earliestInstantTimeFunc) { +return getStartTimes(metaClient.getCommitsTimeline().filterCompletedInstants(), Option.ofNullable(rangeStart), Option.ofNullable(rangeEnd), +InstantRange.RangeType.CLOSE_CLOSE, earliestInstantTimeFunc); + } + + /** + * Queries the instant start time with given completion time range. + * + * @param readTimelineThe read timeline. + * @param rangeStart The query range start completion time. + * @param rangeEndThe query range end completion time. + * @param rangeType The range type. + * @param earliestInstantTimeFunc The function to generate the earliest start time boundary + *with the minimum completion time. + * + * @return The sorted instant time list. + */ + public List getStartTimes( Review Comment: The completion time is user specified, the main purpose of this clazz is transfering the ocmpletion time range into instant time range. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787710 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { + public static final String START_COMMIT_EARLIEST = "earliest"; + + private final HoodieTableMetaClient metaClient; + private final Option startTime; + private final Option endTime; + private final InstantRange.RangeType rangeType; + private final boolean skipCompaction; + private final boolean skipClustering; + private final int limit; + + private IncrementalQueryAnalyzer( + HoodieTableMetaClient metaClient, + String startTime, + String endTime, + InstantRange.RangeType rangeType, + boolean skipCompaction, + boolean skipClustering, + int limit)
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787379 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { + public static final String START_COMMIT_EARLIEST = "earliest"; + + private final HoodieTableMetaClient metaClient; + private final Option startTime; + private final Option endTime; + private final InstantRange.RangeType rangeType; + private final boolean skipCompaction; + private final boolean skipClustering; + private final int limit; + + private IncrementalQueryAnalyzer( + HoodieTableMetaClient metaClient, + String startTime, + String endTime, + InstantRange.RangeType rangeType, + boolean skipCompaction, + boolean skipClustering, + int limit)
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787494 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { + public static final String START_COMMIT_EARLIEST = "earliest"; + + private final HoodieTableMetaClient metaClient; + private final Option startTime; + private final Option endTime; + private final InstantRange.RangeType rangeType; + private final boolean skipCompaction; + private final boolean skipClustering; + private final int limit; + + private IncrementalQueryAnalyzer( + HoodieTableMetaClient metaClient, + String startTime, + String endTime, + InstantRange.RangeType rangeType, + boolean skipCompaction, + boolean skipClustering, + int limit)
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787116 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { + public static final String START_COMMIT_EARLIEST = "earliest"; + + private final HoodieTableMetaClient metaClient; + private final Option startTime; + private final Option endTime; + private final InstantRange.RangeType rangeType; + private final boolean skipCompaction; + private final boolean skipClustering; + private final int limit; + + private IncrementalQueryAnalyzer( + HoodieTableMetaClient metaClient, + String startTime, + String endTime, + InstantRange.RangeType rangeType, + boolean skipCompaction, + boolean skipClustering, + int limit)
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786843 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java: ## @@ -166,6 +169,28 @@ public boolean isInRange(String instant) { } } + /** + * Composition of multiple instant ranges in disjunctive form. Review Comment: Yeah, the `conjunctive` and `disjunctive` is terminology for relation algebra(SQL), `disjunctive` usually means `OR`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]
danny0405 commented on code in PR #10255: URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786517 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java: ## @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.InstantRange; +import org.apache.hudi.common.table.timeline.CompletionTimeQueryView; +import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.ClusteringUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.VisibleForTesting; +import org.apache.hudi.common.util.collection.Pair; + +import javax.annotation.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import java.util.stream.Stream; + +/** + * Analyzer for incremental queries. + * + * The analyzer can supply info about the incremental queries including: + * + * The archived instant candidates; + * The active instant candidates; + * The instant filtering predicate, e.g the instant range; + * Whether the query starts from the earliest; + * Whether the query ends to the latest; + * The max completion time used for fs view file slice version filtering. + * + * + * Criteria for different query ranges: + * + * + * + * Query Range + * File Handles Decoding + * Instant Filtering Predicate + * + * + * [earliest, _] + * The latest snapshot files from table metadata + * _ + * + * + * [earliest, endTime] + * The latest snapshot files from table metadata + * '_hoodie_commit_time' in setA, setA is a collection of all the instants completed before or on 'endTime' + * + * + * [_, _] + * The latest completed instant metadata + * '_hoodie_commit_time' = i_n, i_n is the latest completed instant + * + * + * [_, endTime] + * i).find the last completed instant i_n before or on 'endTim; + * ii). read the latest snapshot from table metadata if i_n is archived or the commit metadata if it is still active + * '_hoodie_commit_time' = i_n + * + * + * [startTime, _] + * i).find the instant set setA, setA is a collection of all the instants completed before or on 'endTime'; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * [earliest, endTime] + * i).find the instant set setA, setA is a collection of all the instants completed in the given time range; + * ii). read the latest snapshot from table metadata if setA has archived instants or the commit metadata if all the instants are still active + * '_hoodie_commit_time' in setA + * + * + * + * A range type is required for analyzing the query so that the query range boundary inclusiveness have clear semantics. + * + * IMPORTANT: the reader may optionally choose to fall back to reading the latest snapshot if there are files missing from decoding the commit metadata. + */ +public class IncrementalQueryAnalyzer { Review Comment: Kind of, we can rename to better one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark SQL insert to write hudI table. Metadata sync failed. [hudi]
AnchorAnim commented on issue #4903: URL: https://github.com/apache/hudi/issues/4903#issuecomment-1858789575 请问这个问题你当时怎么解决的? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1858776223 ## CI report: * b319a7c193a79252dd18c5df782667d96c47a64a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org