[I] [SUPPORT] Duplicate data in base file of MOR table [hudi]
wqwl611 opened a new issue, #10882: URL: https://github.com/apache/hudi/issues/10882 I want to upgrade Hudi from 0.11.1 to 0.13.1, but I encountered the problem of duplicate data. I have never encountered it before with the same configuration. https://github.com/apache/hudi/assets/67826098/9f10b00b-c47f-47dd-959c-1941998614d5";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on PR #10856: URL: https://github.com/apache/hudi/pull/10856#issuecomment-2005976263 Changed all using of `cleaner` to `clean`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1529818335 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java: ## @@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig { + "the minimum number of file slices to retain in each file group, during cleaning."); public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = ConfigProperty - .key("hoodie.clean.trigger.strategy") + .key("hoodie.cleaner.trigger.strategy") .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name()) + .withAlternatives("hoodie.clean.trigger.strategy") .markAdvanced() .withDocumentation(CleaningTriggerStrategy.class); public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty - .key("hoodie.clean.max.commits") + .key("hoodie.cleaner.trigger.max.commits") .defaultValue("1") + .withAlternatives("hoodie.clean.max.commits") .markAdvanced() .withDocumentation("Number of commits after the last clean operation, before scheduling of a new clean is attempted."); public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = ConfigProperty - .key("hoodie.cleaner.incremental.mode") + .key("hoodie.cleaner.incremental.enabled") .defaultValue("true") + .withAlternatives("hoodie.cleaner.incremental.mode") .markAdvanced() .withDocumentation("When enabled, the plans for each cleaner service run is computed incrementally off the events " + "in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full " + "table for each planning (even with a metadata table)."); public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = ConfigProperty - .key("hoodie.cleaner.policy.failed.writes") + .key("hoodie.cleaner.failed.writes.policy") Review Comment: It makes sense, changed all `cleaner` to `clean`, and updated the overview table in this MR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1529818335 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java: ## @@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig { + "the minimum number of file slices to retain in each file group, during cleaning."); public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = ConfigProperty - .key("hoodie.clean.trigger.strategy") + .key("hoodie.cleaner.trigger.strategy") .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name()) + .withAlternatives("hoodie.clean.trigger.strategy") .markAdvanced() .withDocumentation(CleaningTriggerStrategy.class); public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty - .key("hoodie.clean.max.commits") + .key("hoodie.cleaner.trigger.max.commits") .defaultValue("1") + .withAlternatives("hoodie.clean.max.commits") .markAdvanced() .withDocumentation("Number of commits after the last clean operation, before scheduling of a new clean is attempted."); public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = ConfigProperty - .key("hoodie.cleaner.incremental.mode") + .key("hoodie.cleaner.incremental.enabled") .defaultValue("true") + .withAlternatives("hoodie.cleaner.incremental.mode") .markAdvanced() .withDocumentation("When enabled, the plans for each cleaner service run is computed incrementally off the events " + "in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full " + "table for each planning (even with a metadata table)."); public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = ConfigProperty - .key("hoodie.cleaner.policy.failed.writes") + .key("hoodie.cleaner.failed.writes.policy") Review Comment: It makes sense, changed all `cleaner` to `clean`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]
nsivabalan commented on PR #10877: URL: https://github.com/apache/hudi/pull/10877#issuecomment-2005925712 yes, we might need to fix all bundles. @ad1happy2go : can you try out the patch w/ spark and utilities bundle and let us know if we are good here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
hudi-bot commented on PR #10874: URL: https://github.com/apache/hudi/pull/10874#issuecomment-2005869170 ## CI report: * 46e36c45556766aea812b45b8f4fa7aec27e9bc0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22932) * 2b2014449a7254f2d4e733ea281b9b82bf2b5158 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22950) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005868894 ## CI report: * 054e183e0ccfaf58d0034ea76c322b5f166859b8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22949) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
hudi-bot commented on PR #10874: URL: https://github.com/apache/hudi/pull/10874#issuecomment-2005849471 ## CI report: * 46e36c45556766aea812b45b8f4fa7aec27e9bc0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22932) * 2b2014449a7254f2d4e733ea281b9b82bf2b5158 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
danny0405 commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1529748742 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java: ## @@ -118,28 +120,32 @@ public class HoodieCleanConfig extends HoodieConfig { + "the minimum number of file slices to retain in each file group, during cleaning."); public static final ConfigProperty CLEAN_TRIGGER_STRATEGY = ConfigProperty - .key("hoodie.clean.trigger.strategy") + .key("hoodie.cleaner.trigger.strategy") .defaultValue(CleaningTriggerStrategy.NUM_COMMITS.name()) + .withAlternatives("hoodie.clean.trigger.strategy") .markAdvanced() .withDocumentation(CleaningTriggerStrategy.class); public static final ConfigProperty CLEAN_MAX_COMMITS = ConfigProperty - .key("hoodie.clean.max.commits") + .key("hoodie.cleaner.trigger.max.commits") .defaultValue("1") + .withAlternatives("hoodie.clean.max.commits") .markAdvanced() .withDocumentation("Number of commits after the last clean operation, before scheduling of a new clean is attempted."); public static final ConfigProperty CLEANER_INCREMENTAL_MODE_ENABLE = ConfigProperty - .key("hoodie.cleaner.incremental.mode") + .key("hoodie.cleaner.incremental.enabled") .defaultValue("true") + .withAlternatives("hoodie.cleaner.incremental.mode") .markAdvanced() .withDocumentation("When enabled, the plans for each cleaner service run is computed incrementally off the events " + "in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full " + "table for each planning (even with a metadata table)."); public static final ConfigProperty FAILED_WRITES_CLEANER_POLICY = ConfigProperty - .key("hoodie.cleaner.policy.failed.writes") + .key("hoodie.cleaner.failed.writes.policy") Review Comment: Do we prefer `clean` or `cleaner` or `cleaning`? Maybe just stays with `clean`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
suryaprasanna commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733763 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instan writeClient.lazyRollbackFailedIndexing(); } - /** - * Validates the timeline for both main and metadata tables to ensure compaction on MDT can be scheduled. - */ - protected boolean validateTimelineBeforeSchedulingCompaction(Option inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) { -// we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. -// Whenever you want to change this logic, please ensure all below scenarios are considered. -// a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed -// b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, the latest compaction instant time in MDT represents -// any instants before that is already synced with metadata table. -// c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every -// instant before c4 is synced with metadata table. -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); - -if (!pendingInstants.isEmpty()) { - checkNumDeltaCommits(metadataMetaClient, dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending()); - LOG.info(String.format( - "Cannot compact metadata table as there are %d inflight instants in data table before latest deltacommit in metadata table: %s. Inflight instants in data table: %s", - pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, Arrays.toString(pendingInstants.toArray(; - return false; -} - -// Check if there are any pending compaction or log compaction instants in the timeline. -// If pending compact/logCompaction operations are found abort scheduling new compaction/logCompaction operations. -Option pendingLogCompactionInstant = Review Comment: As discussed offline, please create unit test around the inflight scenario and necessary followup ticket to remove logcompaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
danny0405 commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733125 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instan writeClient.lazyRollbackFailedIndexing(); } - /** - * Validates the timeline for both main and metadata tables to ensure compaction on MDT can be scheduled. - */ - protected boolean validateTimelineBeforeSchedulingCompaction(Option inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) { -// we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. -// Whenever you want to change this logic, please ensure all below scenarios are considered. -// a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed -// b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, the latest compaction instant time in MDT represents -// any instants before that is already synced with metadata table. -// c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every -// instant before c4 is synced with metadata table. -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); - -if (!pendingInstants.isEmpty()) { - checkNumDeltaCommits(metadataMetaClient, dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending()); - LOG.info(String.format( - "Cannot compact metadata table as there are %d inflight instants in data table before latest deltacommit in metadata table: %s. Inflight instants in data table: %s", - pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, Arrays.toString(pendingInstants.toArray(; - return false; -} - -// Check if there are any pending compaction or log compaction instants in the timeline. -// If pending compact/logCompaction operations are found abort scheduling new compaction/logCompaction operations. -Option pendingLogCompactionInstant = Review Comment: Finally discussed offline with @suryaprasanna and we are convinced: 1. The first part check for pending instants can be removed because we have NB-CC style file slicing, the instant time sequence does not really matter, only the completion time plays the role; 2. The second part check can be limited to only the log compaction scope. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
danny0405 commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529733125 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instan writeClient.lazyRollbackFailedIndexing(); } - /** - * Validates the timeline for both main and metadata tables to ensure compaction on MDT can be scheduled. - */ - protected boolean validateTimelineBeforeSchedulingCompaction(Option inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) { -// we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. -// Whenever you want to change this logic, please ensure all below scenarios are considered. -// a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed -// b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, the latest compaction instant time in MDT represents -// any instants before that is already synced with metadata table. -// c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every -// instant before c4 is synced with metadata table. -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); - -if (!pendingInstants.isEmpty()) { - checkNumDeltaCommits(metadataMetaClient, dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending()); - LOG.info(String.format( - "Cannot compact metadata table as there are %d inflight instants in data table before latest deltacommit in metadata table: %s. Inflight instants in data table: %s", - pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, Arrays.toString(pendingInstants.toArray(; - return false; -} - -// Check if there are any pending compaction or log compaction instants in the timeline. -// If pending compact/logCompaction operations are found abort scheduling new compaction/logCompaction operations. -Option pendingLogCompactionInstant = Review Comment: Finally discussed offline with @suryaprasanna and we are concinved: 1. The first part check for pending instants can be removed because we have NB-CC style file slicing, the instant time sequence does not really matter, only the completion time plays the role; 2. The second part check can be limited to only the log compaction scope. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005780416 ## CI report: * d7e56c6660139f382830dc545ab703445fba9153 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22936) * 054e183e0ccfaf58d0034ea76c322b5f166859b8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22949) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-2005773466 ## CI report: * d7e56c6660139f382830dc545ab703445fba9153 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22936) * 054e183e0ccfaf58d0034ea76c322b5f166859b8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10856: URL: https://github.com/apache/hudi/pull/10856#discussion_r1529713723 ## website/docs/basic_configurations.md: ## @@ -101,15 +102,14 @@ Flink jobs using the SQL can be configured through the options in WITH clause. T | [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name to register to Hive metastore `Config Param: DATABASE_NAME` | | [hoodie.table.name](#hoodietablename) | (N/A) | Table name to register to Hive metastore `Config Param: TABLE_NAME` | | [path](#path) | (N/A) | Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully `Config Param: PATH` | +| [read.commits.limit](#readcommitslimit) | (N/A) | The maximum number of commits allowed to read in each instant check, if it is streaming read, the avg read instants number per-second would be 'read.commits.limit'/'read.streaming.check-interval', by default no limit `Config Param: READ_COMMITS_LIMIT` | | [read.end-commit](#readend-commit) | (N/A) | End commit instant for reading, the commit time format should be 'MMddHHmmss' `Config Param: READ_END_COMMIT` | | [read.start-commit](#readstart-commit) | (N/A) | Start commit instant for reading, the commit time format should be 'MMddHHmmss', by default reading from the latest instant for streaming read `Config Param: READ_START_COMMIT` | | [archive.max_commits](#archivemax_commits) | 50| Max number of commits to keep before archiving older commits into a sequential log, default 50 `Config Param: ARCHIVE_MAX_COMMITS` | | [archive.min_commits](#archivemin_commits) | 40| Min number of commits to keep before archiving older commits into a sequential log, default 40 `Config Param: ARCHIVE_MIN_COMMITS` | | [cdc.enabled](#cdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode `Config Param: CDC_ENABLED`
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
hudi-bot commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005765864 ## CI report: * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
bhat-vinay commented on code in PR #10865: URL: https://github.com/apache/hudi/pull/10865#discussion_r1529707163 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource( QueryRunner queryRunner, CloudDataFetcher cloudDataFetcher) { super(props, sparkContext, sparkSession, schemaProvider); + +if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) { + sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true"); + sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true"); Review Comment: As I stated above, it did not work. Seems similar to issues like[this](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios) mentioned online. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1529692618 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: Ok, I see your point. Modified corresponding commit by saving previous keys in `FlinkOptions` and adding new keys as fallback keys. Rebased on the current master and force pushed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]
ad1happy2go commented on issue #10873: URL: https://github.com/apache/hudi/issues/10873#issuecomment-2005742951 Sorry @xuzifu666 . Didn't understood. Do you want to say it should archive or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
hudi-bot commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005722576 ## CI report: * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906) * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
hudi-bot commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005717768 ## CI report: * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906) * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
CTTY commented on code in PR #10872: URL: https://github.com/apache/hudi/pull/10872#discussion_r1529666064 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java: ## @@ -57,6 +57,8 @@ */ public class JsonKafkaSource extends KafkaSource { + private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper(); Review Comment: Just being curious, do we have to use static object mapper here or it's fine to wrap it in the iterator as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
wombatu-kun commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005715865 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]
danny0405 commented on PR #10877: URL: https://github.com/apache/hudi/pull/10877#issuecomment-2005707868 @nsivabalan Can you check this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]
danny0405 commented on issue #10881: URL: https://github.com/apache/hudi/issues/10881#issuecomment-2005697002 So you can read the parquet file with spark native reader correctly? Is there any error log from the writer while doing the archiving? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
hudi-bot commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005678158 ## CI report: * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906) * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22948) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]
xicm commented on issue #10881: URL: https://github.com/apache/hudi/issues/10881#issuecomment-2005676734 This is the parquet file ![image](https://github.com/apache/hudi/assets/36392121/c772d5f6-0fb5-4f6c-b6d1-2f69c9a4fb25) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Archived parquet file lenght is 0 when spark do streaming read [hudi]
xicm opened a new issue, #10881: URL: https://github.com/apache/hudi/issues/10881 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** ``` 24/03/19 10:58:34 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 1000 milliseconds, but spent 1141 milliseconds 24/03/19 10:58:38 ERROR MicroBatchExecution: Query [id = 3b464aa8-542d-4eea-b45a-4db8f2a8c8df, runId = d257c670-6a0c-49a8-af65-8cd40f32f1fc] terminated with error org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593) at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735) at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.apache.hudi.common.table.timeline.HoodieArchivedTimeline.loadInstants(HoodieArchivedTimeline.java:267) at org.apache.hudi.common.table.timeline.CompletionTimeQueryView.load(CompletionTimeQueryView.java:323) at org.apache.hudi.common.table.timeline.CompletionTimeQueryView.(CompletionTimeQueryView.java:99) at org.apache.hudi.common.table.timeline.CompletionTimeQueryView.(CompletionTimeQueryView.java:84) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:121) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:115) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:109) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:100) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:179) at org.apache.hudi.MergeOnReadIncrementalRelation.collectFileSplits(MergeOnReadIncrementalRelation.scala:110) at org.apache.hudi.MergeOnReadIncrementalRelation.collectFileSplits(MergeOnReadIncrementalRelation.scala:46) at org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:371) at org.apache.spark.sql.hudi.streaming.HoodieStreamSource.getBatch(HoodieStreamSource.scala:177) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$3(MicroBatchExecution.scala:549) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:27) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:27) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$2(MicroBatchExecution.scala:545) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:545) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:256
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
hudi-bot commented on PR #10866: URL: https://github.com/apache/hudi/pull/10866#issuecomment-2005673058 ## CI report: * 538d59f11cfca29af8550391f0ce7a491279c0a9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22906) * 113f4e5717cdf97aed4f91cecd965c4cec02dec0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
wombatu-kun commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529616940 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java: ## @@ -266,9 +266,9 @@ public void setupTest() { protected static void populateInvalidTableConfigFilePathProps(TypedProperties props, String dfsBasePath) { props.setProperty("hoodie.datasource.write.keygenerator.class", TestHoodieDeltaStreamer.TestGenerator.class.getName()); - props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", "MMdd"); -props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", "uber_db.dummy_table_uber"); - props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile", dfsBasePath + "/config/invalid_uber_config.properties"); +props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", "MMdd"); Review Comment: Thank you, fixed it ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java: ## @@ -279,10 +279,10 @@ protected static void populateAllCommonProps(TypedProperties props, String dfsBa protected static void populateCommonProps(TypedProperties props, String dfsBasePath) { props.setProperty("hoodie.datasource.write.keygenerator.class", TestHoodieDeltaStreamer.TestGenerator.class.getName()); - props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", "MMdd"); -props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", "short_trip_db.dummy_table_short_trip,uber_db.dummy_table_uber"); - props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile", dfsBasePath + "/config/uber_config.properties"); - props.setProperty("hoodie.deltastreamer.ingestion.short_trip_db.dummy_table_short_trip.configFile", dfsBasePath + "/config/short_trip_uber_config.properties"); +props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", "MMdd"); Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
wombatu-kun commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529616434 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/config/SourceTestConfig.java: ## @@ -27,22 +27,22 @@ public class SourceTestConfig { public static final ConfigProperty NUM_SOURCE_PARTITIONS_PROP = ConfigProperty - .key("hoodie.deltastreamer.source.test.num_partitions") + .key("hoodie.streamer.source.test.num_partitions") Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Change drop.partitionpath behavior [hudi]
danny0405 commented on issue #10878: URL: https://github.com/apache/hudi/issues/10878#issuecomment-2005652050 > I went deeper and found that in Hudi 0.13.x and 0.14.x clustering do not honor flag hoodie.datasource.write.drop.partition.columns=true, Is this a bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
danny0405 commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529609990 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instan writeClient.lazyRollbackFailedIndexing(); } - /** - * Validates the timeline for both main and metadata tables to ensure compaction on MDT can be scheduled. - */ - protected boolean validateTimelineBeforeSchedulingCompaction(Option inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) { -// we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. -// Whenever you want to change this logic, please ensure all below scenarios are considered. -// a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed -// b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, the latest compaction instant time in MDT represents -// any instants before that is already synced with metadata table. -// c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every -// instant before c4 is synced with metadata table. -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); - -if (!pendingInstants.isEmpty()) { - checkNumDeltaCommits(metadataMetaClient, dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending()); - LOG.info(String.format( - "Cannot compact metadata table as there are %d inflight instants in data table before latest deltacommit in metadata table: %s. Inflight instants in data table: %s", - pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, Arrays.toString(pendingInstants.toArray(; - return false; -} - -// Check if there are any pending compaction or log compaction instants in the timeline. -// If pending compact/logCompaction operations are found abort scheduling new compaction/logCompaction operations. -Option pendingLogCompactionInstant = Review Comment: > validateTimelineBeforeSchedulingCompaction method basically checks if there are inflight instants or pending compaction or logcompaction plan on the timeline. If any such instants are found it will stop compaction or logcompaction from happening. We now can unlock this restriction because the MDT reader will filter out the invalid instants. > It also makes sure only one pending plan is present on the metadata table's timeline. That way it is easy to handle various cases on metadata table. Can it be in line with what we have for DT, because a pending compaction plan on MDT would block all the following-up scheduling of the compactions. > Pending logcompaction check present in HoodieLogCompactionPlanGenerator class, will exclude file groups that are currently under logcompaction, that means other file groups or partitions in metadata can still create another logcompaction plan. But there is no plan on the file slice that has pending plans (compaction/log_compaction), so the plan should be still valid? > One can argue that the, we should be able to schedule multiple logcompaction plans on the timeline, but we have seen operational issues when running logcompaction and compaction on metadata using table services. So, it is better keep these checks. I just wanna to make sure the strategy as simple and on par with the DT, so that the streaming use cases can also work well. cc @codope for helping to address some of these argue points maybe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
wombatu-kun commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529606592 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java: ## @@ -73,12 +73,12 @@ public static void beforeAll() throws Exception { @BeforeEach public void setup() throws Exception { super.setup(); -PROPS.setProperty("hoodie.deltastreamer.jdbc.url", "jdbc:h2:mem:test_mem"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.driver.class", "org.h2.Driver"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.user", "test"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.password", "jdbc"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.table.name", "triprec"); -connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc"); +PROPS.setProperty("hoodie.streamer.jdbc.url", "jdbc:h2:mem:test_mem"); +PROPS.setProperty("hoodie.streamer.jdbc.driver.class", "org.h2.Driver"); +PROPS.setProperty("hoodie.streamer.jdbc.user", "sa"); +PROPS.setProperty("hoodie.streamer.jdbc.password", ""); +PROPS.setProperty("hoodie.streamer.jdbc.table.name", "triprec"); +connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", ""); Review Comment: reverted, will do it in separate PR. ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java: ## @@ -438,7 +438,7 @@ public void testSourceWithStorageLevel() { private void writeSecretToFs() throws IOException { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream outputStream = fs.create(new Path("file:///tmp/hudi/config/secret")); -outputStream.writeBytes("jdbc"); +outputStream.writeBytes(""); Review Comment: reverted, will do it in separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
wombatu-kun commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529606423 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java: ## @@ -2418,15 +2418,15 @@ public void testSqlSourceSource() throws Exception { @Disabled @Test public void testJdbcSourceIncrementalFetchInContinuousMode() { -try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc")) { +try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "")) { Review Comment: Yes, it's from another fix. It's really better to do it in separate PR later. For now - reverted JDBC user name and password changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (b7ccecf3205 -> 7631e0dcb89)
This is an automated email from the ASF dual-hosted git repository. stream2000 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from b7ccecf3205 [HUDI-7492] Fix the incorrect keygenerator specification for multi partition or multi primary key tables creation (#10840) add 7631e0dcb89 [MINOR] Add Hudi icon for idea (#10880) No new revisions were added by this update. Summary of changes: .gitignore | 1 + .idea/icon.png | Bin 0 -> 14245 bytes 2 files changed, 1 insertion(+) create mode 100644 .idea/icon.png
Re: [PR] [MINOR] Add Hudi icon for idea [hudi]
stream2000 merged PR #10880: URL: https://github.com/apache/hudi/pull/10880 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add Hudi icon for idea [hudi]
hudi-bot commented on PR #10880: URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005617775 ## CI report: * 543cc32433495f4af12239307bf87a94dedacb94 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22946) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005612033 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * c15a59630c3d1def347ad50993177ce22adca790 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22945) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add Hudi icon for idea [hudi]
hudi-bot commented on PR #10880: URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005612176 ## CI report: * 543cc32433495f4af12239307bf87a94dedacb94 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add Hudi icon for idea [hudi]
stream2000 commented on PR #10880: URL: https://github.com/apache/hudi/pull/10880#issuecomment-2005612134 Looks good, thanks for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Add Hudi icon for idea [hudi]
qidian99 opened a new pull request, #10880: URL: https://github.com/apache/hudi/pull/10880 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
danny0405 commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529562361 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1070,7 +1070,7 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { Map> partitionFilesToAdd = new HashMap<>(); Map> partitionFilesToDelete = new HashMap<>(); List partitionsToDelete = new ArrayList<>(); - fetchOutofSyncFilesRecordsFromMetadataTable(dirInfoMap, partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete); + fetchAutofSyncFilesRecordsFromMetadataTable(dirInfoMap, partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete); Review Comment: yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
suryaprasanna commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529421217 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1070,7 +1070,7 @@ public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { Map> partitionFilesToAdd = new HashMap<>(); Map> partitionFilesToDelete = new HashMap<>(); List partitionsToDelete = new ArrayList<>(); - fetchOutofSyncFilesRecordsFromMetadataTable(dirInfoMap, partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete); + fetchAutofSyncFilesRecordsFromMetadataTable(dirInfoMap, partitionFilesToAdd, partitionFilesToDelete, partitionsToDelete); Review Comment: Is this a typo? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instan writeClient.lazyRollbackFailedIndexing(); } - /** - * Validates the timeline for both main and metadata tables to ensure compaction on MDT can be scheduled. - */ - protected boolean validateTimelineBeforeSchedulingCompaction(Option inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) { -// we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. -// Whenever you want to change this logic, please ensure all below scenarios are considered. -// a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed -// b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, the latest compaction instant time in MDT represents -// any instants before that is already synced with metadata table. -// c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every -// instant before c4 is synced with metadata table. -List pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); - -if (!pendingInstants.isEmpty()) { - checkNumDeltaCommits(metadataMetaClient, dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending()); - LOG.info(String.format( - "Cannot compact metadata table as there are %d inflight instants in data table before latest deltacommit in metadata table: %s. Inflight instants in data table: %s", - pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, Arrays.toString(pendingInstants.toArray(; - return false; -} - -// Check if there are any pending compaction or log compaction instants in the timeline. -// If pending compact/logCompaction operations are found abort scheduling new compaction/logCompaction operations. -Option pendingLogCompactionInstant = Review Comment: validateTimelineBeforeSchedulingCompaction method basically checks if there are inflight instants or pending compaction or logcompaction plan on the timeline. If any such instants are found it will stop compaction or logcompaction from happening. It also makes sure only one pending plan is present on the metadata table's timeline. That way it is easy to handle various cases on metadata table. Pending logcompaction check present in HoodieLogCompactionPlanGenerator class, will exclude file groups that are currently under logcompaction, that means other file groups or partitions in metadata can still create another logcompaction plan. So, the checks on pending compaction and logcompaction instants is required. One can argue that the, we should be able to schedule multiple logcompaction plans on the timeline, but we have seen operational issues when running logcompaction and compaction on metadata using table services. So, it is better keep these checks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]
danny0405 commented on code in PR #10874: URL: https://github.com/apache/hudi/pull/10874#discussion_r1529560937 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1130,22 +1128,6 @@ public void update(HoodieRollbackMetadata rollbackMetadata, String instantTime) } } - protected void validateRollback( - String commitToRollbackInstantTime, - HoodieInstant compactionInstant, Review Comment: We can do this now becase the MOR rollback is changed to base on file deletion similiar with the COW table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1529551851 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.DummyActiveAction; +import org.apache.hudi.client.timeline.LSMTimelineWriter; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.engine.LocalTaskContextSupplier; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.index.HoodieIndex; + +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Test cases for {@link HoodieGlobalTimeline}. + */ +public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness { + @BeforeEach + public void setUp() throws Exception { +initMetaClient(); + } + + @AfterEach + public void tearDown() throws Exception { +cleanMetaClient(); + } + + /** + * The test for checking whether an instant is archived. + */ + @Test + void testArchivingCheck() throws Exception { +writeArchivedTimeline(10, 1000, 50); +writeActiveTimeline(1050, 10); +HoodieGlobalTimeline globalTimeline = new HoodieGlobalTimeline(this.metaClient, Option.empty()); +assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant should be active"); Review Comment: Renames `containsOrBeforeTimelineStarts` to `isValidInstant` and `isBeforeTimelineStarts` to `isArchived` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005552101 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * 948d9ecb6dc661628b787ba800756b78d52791af Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885) * c15a59630c3d1def347ad50993177ce22adca790 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22945) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828123#comment-17828123 ] sivabalan narayanan edited comment on HUDI-7507 at 3/19/24 1:12 AM: Just trying to replay the same scenario for data table, conflict resolution could have aborted job2. and hence we may not hit the same issue. was (Author: shivnarayan): Just trying to replay the same scenario for data table, conflict resolution could have aborted job2. and hence we may not hit the same issue. > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Krishen Bhan >Priority: Major > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > An alternate approach (suggested by [~pwason] ) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828142#comment-17828142 ] sivabalan narayanan commented on HUDI-7507: --- We have already fixed it w/ latest master (1.0) by generating the new commit times using locks. That should solve the issue. We can apply the same to 0.X branch. > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Krishen Bhan >Priority: Major > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > An alternate approach (suggested by [~pwason] ) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-2005544453 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * 948d9ecb6dc661628b787ba800756b78d52791af Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885) * c15a59630c3d1def347ad50993177ce22adca790 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]
danny0405 commented on issue #10879: URL: https://github.com/apache/hudi/issues/10879#issuecomment-2005456434 Did you specify the read options like `read.utc-timezone`, by default it is true, and recently we also support the write utc timezone option in: https://github.com/apache/hudi/pull/10594 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
danny0405 commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1529509534 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: I kind of think the sql options should not be too long, and we keep compatibility for hudi altenatives. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]
rmahindra123 commented on code in PR #10876: URL: https://github.com/apache/hudi/pull/10876#discussion_r1529460065 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java: ## @@ -90,8 +94,11 @@ public class UpsertPartitioner extends SparkHoodiePartitioner { public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext context, HoodieTable table, Review Comment: Should we add the implementation to a new class, may be sortedUpsertPartitioner or something, so there is a clean separation. We can use the same config to control which one gets called. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828123#comment-17828123 ] sivabalan narayanan commented on HUDI-7507: --- Just trying to replay the same scenario for data table, conflict resolution could have aborted job2. and hence we may not hit the same issue. > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Krishen Bhan >Priority: Major > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > An alternate approach (suggested by [~pwason] ) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2005124255 ## CI report: * 629e91bc0267c0728b98326eb84072965c600205 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22928) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2005007838 ## CI report: * 896491233f44039e8874d5a3080dd686fffd044e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22944) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2004996708 ## CI report: * 629e91bc0267c0728b98326eb84072965c600205 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22928) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2004984687 ## CI report: * 629e91bc0267c0728b98326eb84072965c600205 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Changing the Properties to Load From Both Default Path and Enviorment [hudi]
yihua commented on PR #10835: URL: https://github.com/apache/hudi/pull/10835#issuecomment-2004976726 @Amar1404 any update on the PR and @CTTY ’s suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004847718 ## CI report: * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943) * 896491233f44039e8874d5a3080dd686fffd044e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22944) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
yihua commented on code in PR #10865: URL: https://github.com/apache/hudi/pull/10865#discussion_r1529188995 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource( QueryRunner queryRunner, CloudDataFetcher cloudDataFetcher) { super(props, sparkContext, sparkSession, schemaProvider); + +if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) { + sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true"); + sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true"); Review Comment: See spark docs: https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files `Spark allows you to use the configuration spark.sql.files.ignoreMissingFiles or the data source option ignoreMissingFiles to ignore missing files while reading data from files.` You need to set `.option("ignoreMissingFiles")` to achieve the behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004824804 ## CI report: * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943) * 896491233f44039e8874d5a3080dd686fffd044e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]
yihua commented on code in PR #10727: URL: https://github.com/apache/hudi/pull/10727#discussion_r1529152799 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java: ## @@ -202,7 +202,9 @@ private Option> composeSchemaEvolutionTrans Schema newWriterSchema = AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName()); Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, newWriterSchema.getFullName()); boolean needToReWriteRecord = sameCols.size() != colNamesFromWriteSchema.size() - || SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE; + && SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() + == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE; + Review Comment: @xiarixiaoyao This info is valuable. Basically using pruned schema to read Avro records is supported on Avro 1.10 and above, not on lower versions. I see that Spark 3.2 and above and all Flink versions use Avro 1.10 and above. So for these integrations and others that rely on Avro 1.10 and above, we should use pruned schema to read log records to improve performance. I'll check the new file group reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
yihua commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-2004747912 > @yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned? The 0.15.0 release branch is planned to be cut this month once we verify engine integrations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
yihua commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529100966 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java: ## @@ -2418,15 +2418,15 @@ public void testSqlSourceSource() throws Exception { @Disabled @Test public void testJdbcSourceIncrementalFetchInContinuousMode() { -try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc")) { +try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", "")) { Review Comment: Looks like this is from another fix. Could the JDBC user and password be put into static variables to avoid any misconfigs in tests in the future? ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java: ## @@ -438,7 +438,7 @@ public void testSourceWithStorageLevel() { private void writeSecretToFs() throws IOException { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream outputStream = fs.create(new Path("file:///tmp/hudi/config/secret")); -outputStream.writeBytes("jdbc"); +outputStream.writeBytes(""); Review Comment: Similar here for the password. And should we add the password back for testing the secret (this refactoring of static variables and password should be in a separate PR)? ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestJdbcSource.java: ## @@ -73,12 +73,12 @@ public static void beforeAll() throws Exception { @BeforeEach public void setup() throws Exception { super.setup(); -PROPS.setProperty("hoodie.deltastreamer.jdbc.url", "jdbc:h2:mem:test_mem"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.driver.class", "org.h2.Driver"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.user", "test"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.password", "jdbc"); -PROPS.setProperty("hoodie.deltastreamer.jdbc.table.name", "triprec"); -connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc"); +PROPS.setProperty("hoodie.streamer.jdbc.url", "jdbc:h2:mem:test_mem"); +PROPS.setProperty("hoodie.streamer.jdbc.driver.class", "org.h2.Driver"); +PROPS.setProperty("hoodie.streamer.jdbc.user", "sa"); +PROPS.setProperty("hoodie.streamer.jdbc.password", ""); +PROPS.setProperty("hoodie.streamer.jdbc.table.name", "triprec"); +connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "sa", ""); Review Comment: Similar here for the JDBC user name and password. ## hudi-utilities/src/test/resources/streamer-config/invalid_hive_sync_uber_config.properties: ## @@ -18,6 +18,6 @@ include=base.properties hoodie.datasource.write.recordkey.field=_row_key hoodie.datasource.write.partitionpath.field=created_at -hoodie.deltastreamer.source.kafka.topic=test_topic -hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP -hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd \ No newline at end of file +hoodie.streamer.source.kafka.topic=test_topic +hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP Review Comment: reminder for fixing keygen config. ## hudi-utilities/src/test/resources/streamer-config/uber_config.properties: ## @@ -18,10 +18,10 @@ include=base.properties hoodie.datasource.write.recordkey.field=_row_key hoodie.datasource.write.partitionpath.field=created_at -hoodie.deltastreamer.source.kafka.topic=topic1 -hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP -hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd HH:mm:ss.S +hoodie.streamer.source.kafka.topic=topic1 +hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP Review Comment: reminder for fixing keygen config. ## hudi-utilities/src/test/resources/streamer-config/short_trip_uber_config.properties: ## @@ -18,11 +18,11 @@ include=base.properties hoodie.datasource.write.recordkey.field=_row_key hoodie.datasource.write.partitionpath.field=created_at -hoodie.deltastreamer.source.kafka.topic=topic2 -hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP -hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd HH:mm:ss.S +hoodie.streamer.source.kafka.topic=topic2 +hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP Review Comment: reminder for fixing keygen config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7187] Fix integ test props to honor new streamer properties [hudi]
yihua commented on code in PR #10866: URL: https://github.com/apache/hudi/pull/10866#discussion_r1529083756 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/config/SourceTestConfig.java: ## @@ -27,22 +27,22 @@ public class SourceTestConfig { public static final ConfigProperty NUM_SOURCE_PARTITIONS_PROP = ConfigProperty - .key("hoodie.deltastreamer.source.test.num_partitions") + .key("hoodie.streamer.source.test.num_partitions") Review Comment: Let's add `.withAlternatives()` to preserve the old config naming to be backward compatible and use `STREAMER_CONFIG_PREFIX` and `DELTA_STREAMER_CONFIG_PREFIX`. Similar for other configs in this class, `SourceTestConfig`. See `HoodieStreamerConfig.CHECKPOINT_PROVIDER_PATH` for reference: ``` public static final ConfigProperty CHECKPOINT_PROVIDER_PATH = ConfigProperty .key(STREAMER_CONFIG_PREFIX + "checkpoint.provider.path") .noDefaultValue() .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + "checkpoint.provider.path") .markAdvanced() .withDocumentation("The path for providing the checkpoints."); ``` ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java: ## @@ -266,9 +266,9 @@ public void setupTest() { protected static void populateInvalidTableConfigFilePathProps(TypedProperties props, String dfsBasePath) { props.setProperty("hoodie.datasource.write.keygenerator.class", TestHoodieDeltaStreamer.TestGenerator.class.getName()); - props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", "MMdd"); -props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", "uber_db.dummy_table_uber"); - props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile", dfsBasePath + "/config/invalid_uber_config.properties"); +props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", "MMdd"); Review Comment: For key generator-related configs, the config prefix is renamed from `hoodie.deltastreamer.keygen.timebased.` to `hoodie.keygen.timebased.` (see class `TimestampKeyGeneratorConfig`). ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java: ## @@ -279,10 +279,10 @@ protected static void populateAllCommonProps(TypedProperties props, String dfsBa protected static void populateCommonProps(TypedProperties props, String dfsBasePath) { props.setProperty("hoodie.datasource.write.keygenerator.class", TestHoodieDeltaStreamer.TestGenerator.class.getName()); - props.setProperty("hoodie.deltastreamer.keygen.timebased.output.dateformat", "MMdd"); -props.setProperty("hoodie.deltastreamer.ingestion.tablesToBeIngested", "short_trip_db.dummy_table_short_trip,uber_db.dummy_table_uber"); - props.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile", dfsBasePath + "/config/uber_config.properties"); - props.setProperty("hoodie.deltastreamer.ingestion.short_trip_db.dummy_table_short_trip.configFile", dfsBasePath + "/config/short_trip_uber_config.properties"); +props.setProperty("hoodie.streamer.keygen.timebased.output.dateformat", "MMdd"); Review Comment: Similar here and other places for key generator configs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
vinothchandar commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1529092922 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.DummyActiveAction; +import org.apache.hudi.client.timeline.LSMTimelineWriter; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.engine.LocalTaskContextSupplier; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.index.HoodieIndex; + +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Test cases for {@link HoodieGlobalTimeline}. + */ +public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness { + @BeforeEach + public void setUp() throws Exception { +initMetaClient(); + } + + @AfterEach + public void tearDown() throws Exception { +cleanMetaClient(); + } + + /** + * The test for checking whether an instant is archived. + */ + @Test + void testArchivingCheck() throws Exception { +writeArchivedTimeline(10, 1000, 50); +writeActiveTimeline(1050, 10); +HoodieGlobalTimeline globalTimeline = new HoodieGlobalTimeline(this.metaClient, Option.empty()); +assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant should be active"); Review Comment: works for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation [hudi]
yihua commented on code in PR #10778: URL: https://github.com/apache/hudi/pull/10778#discussion_r1529060885 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala: ## @@ -58,9 +59,19 @@ object AvroConversionUtils { */ def createInternalRowToAvroConverter(rootCatalystType: StructType, rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = { val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, rootAvroType, nullable) -row => serializer - .serialize(row) - .asInstanceOf[GenericRecord] +row => { + try { +serializer + .serialize(row) + .asInstanceOf[GenericRecord] + } catch { +case e: HoodieSchemaException => throw e +case e => throw new SchemaCompatibilityException("Failed to convert spark record into avro record", e) + } + Review Comment: nit: remove empty line ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala: ## @@ -58,9 +59,19 @@ object AvroConversionUtils { */ def createInternalRowToAvroConverter(rootCatalystType: StructType, rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = { val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, rootAvroType, nullable) -row => serializer - .serialize(row) - .asInstanceOf[GenericRecord] +row => { + try { +serializer + .serialize(row) + .asInstanceOf[GenericRecord] + } catch { +case e: HoodieSchemaException => throw e +case e => throw new SchemaCompatibilityException("Failed to convert spark record into avro record", e) + } + +} + + Review Comment: Nit: remove two empty lines ## hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java: ## @@ -159,7 +156,16 @@ public static Option> createHoodieRecords(HoodieStreamer.C * @return the representation of error record (empty {@link HoodieRecord} and the error record * String) for writing to error table. */ - private static Either generateErrorRecord(GenericRecord genRec) { + private static Either generateErrorRecordOrThrowException(GenericRecord genRec, Exception e, boolean shouldErrorTable) { +if (!shouldErrorTable) { + if (e instanceof HoodieKeyException) { +throw (HoodieKeyException) e; + } else if (e instanceof HoodieKeyGeneratorException) { +throw (HoodieKeyGeneratorException) e; Review Comment: Nit: could the `if` and `else if` be merged like before? Also, there is no need to do type cast here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004464696 ## CI report: * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]
dderjugin opened a new issue, #10879: URL: https://github.com/apache/hudi/issues/10879 I'm writing data using Flink DataStream API to MoR table with partitioning by a column of Long type and PK as Timestamp field. Spark SNAPSHOT query gives an incorrect value of the latest records: +21971-04-23 04:46:37 instead of 1990-01-01 07:51:09.997 Read optimized query gives a correct result. **To Reproduce** 1. Push data to MOR table using Flink 1.17.2 DataStream API. PK is timestamp, partitioning field is Long. 2. Query the table data using Spark snapshot query: "select count(*), min(ts), max(ts) from [table]" 3. The max value of "ts" column is incorrect: +21971-04-23 04:46:37 **Expected behavior** Max "ts" column value should be "1990-01-01 07:51:09.997" Examples: - correct result using read optimized query: ++---+---+ |count(1)|min(ts)|max(ts)| ++---+---+ |5166373 |1989-12-31 23:59:50|1990-01-01 05:47:39.998| ++---+---+ - incorrect result using snapshot query: ++---+-+ |count(1)|min(ts)|max(ts) | ++---+-+ |6033156 |1989-12-31 23:59:50|+21971-02-25 05:42:13| ++---+-+ Detailed query shows that only PK column value is incorrect and the rest of the table columns have expected values. Data in the corresponding parquet file looks correct for all columns including PK column. **Environment Description** * Hudi version : 0.14.1 * Spark version : 3.4.2 * Hive version : 2.3.9 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : local filesystem * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]
xuzifu666 commented on issue #10873: URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004342507 > No it's not archiving, But Why you think they should be archived. As all these commits are still valid and should be read in this case, so they should be active only. yes it is not should archive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004326001 ## CI report: * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942) * d934cf86d5375c3dec64f0bb171f455aa488e7fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22943) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004163320 ## CI report: * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942) * d934cf86d5375c3dec64f0bb171f455aa488e7fc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004144844 ## CI report: * 8a6413f86714acd2f2f31a66af42b66526c23665 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22942) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]
ad1happy2go commented on issue #10873: URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004127645 No it's not archiving, But Why you think they should be archived. As all these commits are still valid and should be read in this case, so they should be active only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]
xuzifu666 commented on issue #10873: URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004120534 > yes there was no commit which was archiving. Hi, @ad1happy2go this case is alse archived? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert overwrite with replacement instant cannot execute archive [hudi]
ad1happy2go commented on issue #10873: URL: https://github.com/apache/hudi/issues/10873#issuecomment-2004095974 @xuzifu666 In case if you saying this to update above code - ``` for i in range(1,100): df = df.withColumn("city", lit(i)) (df.write.format("hudi").options(**hudi_options).mode("append").save(PATH)) ``` With above change, yes there was no commit which was archiving. But I am thinking why they will be even archived, as all partitions data will still be valid and all the commits are valid and should be active. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
vinishjail97 commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004067378 > Is it possible to get a test coverage for the source classes touched and share the report. > > Generally, GcsEventsHoodieIncrSource and S3EventsHoodieIncrSource does not have any function tests right. so, we rely heavily on mocked UTs. GcsEventsHoodieIncrSource, S3EventsHoodieIncrSource, CloudDataFetcher and CloudObjectsSelectorCommon have decent coverage https://github.com/apache/hudi/assets/16958856/b98d1b24-9598-4cb9-b9f7-53e473cbef77";> https://github.com/apache/hudi/assets/16958856/56c013a0-b0f1-4ce2-be5d-d992b207f6b5";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7501] Use source profile for S3 and GCS sources [hudi]
hudi-bot commented on PR #10861: URL: https://github.com/apache/hudi/pull/10861#issuecomment-2004035486 ## CI report: * c2f8413cf59f8dd3d14ab3bc638cf7e2eb989d80 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22896) * 8a6413f86714acd2f2f31a66af42b66526c23665 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
hudi-bot commented on PR #10865: URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003639403 ## CI report: * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22940) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Change drop.partitionpath behavior [hudi]
VitoMakarevich opened a new issue, #10878: URL: https://github.com/apache/hudi/issues/10878 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Hudi 0.12.2 version. Hello, we are now using Hudi COW with `hoodie.datasource.write.drop.partition.columns=true`(non-default). And recently we wanted to run clustering for this dataset. The problem we observe is that `bulk_insert` used in clustering does not honor this setting and `parquet` files are created with partition columns. As far as I checked it's not a concern for reading via Spark, but it causes problem for upserting for those partitions where we had clustering. If an update comes to such file, in 0.12.x we see failures like `Parquet/Avro schema mismatch: Avro field 'partition1' not found`. [Reproduction here](https://github.com/VitoMakarevich/hudi-incremental-issue/blob/master/src/main/scala/com/example/hudi/HudiClusteringPartitionColumn.scala) I went deeper and found that in Hudi 0.13.x and 0.14.x clustering do not honor flag `hoodie.datasource.write.drop.partition.columns=true`, but there it does not cause problems with upsert - partitions can udpated. Given this, we probably cannot upgrade now, but would like to execute clustering. So I need help with the following question: 1. Is it safe to change `hoodie.datasource.write.drop.partition.columns` from `true` to `false`? As far as I observed it looks to be working, but. 2. How it can be changed? Since changing it in write options does not help, because it's written into `hoodie.properties` file - if I change the contents of this file - it works this way: subsequent update appends partition column value, but only for changed rows, old remains with `null`. However, in the read path, old rows have this field populated(some magic). And once clustering is run - all rows have field populated. Important: we would not be able to execute clustering all table at once - likely, so if we run it in batches, is it safe to have part of the files including partition columns and part not? Nice to answer questions: 1. Do you know what might cause change of behavior in 0.12.3 -> 0.13.0 so that the write works after clustering? **To Reproduce** [Reproduction here](https://github.com/VitoMakarevich/hudi-incremental-issue/blob/master/src/main/scala/com/example/hudi/HudiClusteringPartitionColumn.scala) **Environment Description** * Hudi version : 0.12.2 - 0.14.x * Spark version : 3.3.0 **Additional context** Add any other context about the problem here. **Stacktrace** ``` Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 128.0 failed 1 times, most recent failure: Lost task 0.0 in stage 128.0 (TID 531) (192.168.0.143 executor driver): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]
hudi-bot commented on PR #10876: URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003610845 ## CI report: * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
bhat-vinay commented on code in PR #10865: URL: https://github.com/apache/hudi/pull/10865#discussion_r1528192748 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource( QueryRunner queryRunner, CloudDataFetcher cloudDataFetcher) { super(props, sparkContext, sparkSession, schemaProvider); + +if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) { + sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true"); + sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true"); Review Comment: @nsivabalan Almost all [online](https://www.waitingforcode.com/apache-spark-sql/ignoring-files-issues-apache-spark-sql/read) [resources](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios) that I found on the usage of `spark.sql.files.ignoreMissingFiles` says that it needs to be set at the `SparkSession::conf()` level. I did try setting it through the `DataFrameReader::option` using `spark.read().format(fileFormat).option("spark.sql.files.ignoreMissingFiles", "true")`, but the `reader.load(...)` fails with `FileNotFoundException` for missing files. Can trigger this behavior with the added tests (by commenting out `sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true")` in `TestCloudObjectsSelectorCommon::ignoreMissingFiles(...)`. Debugger does show that the option is set in the `DataFrameReader` object, but it does not seem to impact the outcome. Seems similar to [this issue](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios) ``` .. .. at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:3131) at org.apache.hudi.utilities.sources.helpers.TestCloudObjectsSelectorCommon.ignoreMissingFiles(TestCloudObjectsSelectorCommon.java:173) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84) at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129) at org.junit.p
Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]
hudi-bot commented on PR #10876: URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003452778 ## CI report: * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937) * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
hudi-bot commented on PR #10865: URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003452570 ## CI report: * 870069b52e67240b93cbfa9e7184eb933f296ff6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22905) * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22940) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
hudi-bot commented on PR #10865: URL: https://github.com/apache/hudi/pull/10865#issuecomment-2003422476 ## CI report: * 870069b52e67240b93cbfa9e7184eb933f296ff6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22905) * 06e57fd2e8a4cb8276f7d718cf9614757e6e5788 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]
hudi-bot commented on PR #10876: URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003422824 ## CI report: * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937) * 5016a9c8d9daeea9f6f28f63cc090514482571a4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]
bhat-vinay commented on code in PR #10865: URL: https://github.com/apache/hudi/pull/10865#discussion_r1528192748 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource( QueryRunner queryRunner, CloudDataFetcher cloudDataFetcher) { super(props, sparkContext, sparkSession, schemaProvider); + +if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) { + sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true"); + sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true"); Review Comment: @nsivabalan Almost all [online](https://www.waitingforcode.com/apache-spark-sql/ignoring-files-issues-apache-spark-sql/read) [resources](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios) that I found on the usage of `spark.sql.files.ignoreMissingFiles` says that it needs to be set at the `SparkSession::conf()` level. I did try setting it through the `DataFrameReader::option` using `spark.read().format(fileFormat).option("ignoreMissingFiles", "true")`, but the `reader.load(...)` fails with `FileNotFoundException` for missing files. Can trigger this behavior with the added tests (by commenting out `sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true")` in `TestCloudObjectsSelectorCommon::ignoreMissingFiles(...)`. Debugger does show that the option is set in the `DataFrameReader` object, but it does not seem to impact the outcome. Seems similar to [this issue](https://stackoverflow.com/questions/77195180/ignoremissingfiles-option-not-working-in-certain-scenarios) ``` .. .. at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:3131) at org.apache.hudi.utilities.sources.helpers.TestCloudObjectsSelectorCommon.ignoreMissingFiles(TestCloudObjectsSelectorCommon.java:173) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84) at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129) at org.junit.platform.engine.s
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi internal options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know Hudi internal naming version for the same option. Could we create some naming convention for options? It's a nightmare, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. Could you, please, take a look at the attached file in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi internal options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know some Spark or Hudi internal naming version for the same option. Could we create some naming convention for options? It's a nightmare, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. Could you, please, take a look at the attached file in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi kernel options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know some Spark or Hudi internal naming version for the same option. Could we create some naming convention for options? It's a nightmare, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. Could you, please, take a look at the attached file in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi kernel options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know some Spark or Hudi internal naming version for the same option. Could we create some naming convention for options? It's a nightmare now, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. Could you, please, take a look at the attached file in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi kernel options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know some Spark or kernel naming version for the same option. Could we create some naming convention for options? It's a nightmare now, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. Could you, please, take a look at the attached file in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528087176 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: There are some parameters without shortcuts in `FlinkOptions`: ``` hoodie.bucket.index.hash.field hoodie.bucket.index.num.buckets hoodie.database.name hoodie.datasource.merge.type hoodie.datasource.query.type hoodie.datasource.write.hive_style_partitioning hoodie.datasource.write.keygenerator.class hoodie.datasource.write.keygenerator.type hoodie.datasource.write.partitionpath.field hoodie.datasource.write.partitionpath.urlencode hoodie.datasource.write.recordkey.field hoodie.index.bucket.engine hoodie.table.name ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528084191 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: @danny0405 , if it doesn't bother you, could you, please, clarify, why Flink options are different from the similar Hudi kernel options? From user point of view, it's unnecessary complication, you need to look for Flink variation of naming even if you know some Spark or kernel naming version for the same option. Could we create some naming convention for options? It's a nightmare now, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738), we have 900+ options for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7513] Add jackson-module-scala to spark bundle [hudi]
danny0405 commented on code in PR #10877: URL: https://github.com/apache/hudi/pull/10877#discussion_r1528081859 ## pom.xml: ## @@ -482,6 +482,7 @@ org.apache.htrace:htrace-core4 com.fasterxml.jackson.module:jackson-module-afterburner + com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version} Review Comment: cc @nsivabalan do we need to check other jackson related jars? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
danny0405 commented on code in PR #10851: URL: https://github.com/apache/hudi/pull/10851#discussion_r1528063145 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -725,40 +725,45 @@ private FlinkOptions() { .withDescription("Target IO in MB for per compaction (both read and write), default 500 GB"); public static final ConfigOption CLEAN_ASYNC_ENABLED = ConfigOptions - .key("clean.async.enabled") + .key("hoodie.cleaner.async.enabled") Review Comment: `clean.async.enabled` is shortcut option specifically for flink, `hoodie.cleaner.async.enabled` should be the fallback key instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7492) When using Flinkcatalog to create hudi multiple partitions or multiple primary keys, the keygenerator generation is incorrect
[ https://issues.apache.org/jira/browse/HUDI-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7492. Resolution: Fixed Fixed via master branch: b7ccecf32051e1f880826ee21a4b97cfd3a06f87 > When using Flinkcatalog to create hudi multiple partitions or multiple > primary keys, the keygenerator generation is incorrect > - > > Key: HUDI-7492 > URL: https://issues.apache.org/jira/browse/HUDI-7492 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > When using Flinkcatalog to create hudi multiple partitions or multiple > primary keys, the keygenerator generation is incorrect. > {code:sql} > create table if not exists hudi_catalog.source1.hudi_key_generator_test > ( > id int, > name varchar, > age varchar, > sex int, > primary key (id) not enforced > ) partitioned by (`sex`, `age`) with ( > 'connector' = 'hudi' > ); > {code} > .hoodie/hoodie.properties > {panel:title=hoodie.properties} > hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7492) When using Flinkcatalog to create hudi multiple partitions or multiple primary keys, the keygenerator generation is incorrect
[ https://issues.apache.org/jira/browse/HUDI-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7492: - Fix Version/s: 0.15.0 1.0.0 > When using Flinkcatalog to create hudi multiple partitions or multiple > primary keys, the keygenerator generation is incorrect > - > > Key: HUDI-7492 > URL: https://issues.apache.org/jira/browse/HUDI-7492 > Project: Apache Hudi > Issue Type: Bug >Reporter: 陈磊 >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > When using Flinkcatalog to create hudi multiple partitions or multiple > primary keys, the keygenerator generation is incorrect. > {code:sql} > create table if not exists hudi_catalog.source1.hudi_key_generator_test > ( > id int, > name varchar, > age varchar, > sex int, > primary key (id) not enforced > ) partitioned by (`sex`, `age`) with ( > 'connector' = 'hudi' > ); > {code} > .hoodie/hoodie.properties > {panel:title=hoodie.properties} > hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7492] Fix the incorrect keygenerator specification for multi partition or multi primary key tables creation (#10840)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b7ccecf3205 [HUDI-7492] Fix the incorrect keygenerator specification for multi partition or multi primary key tables creation (#10840) b7ccecf3205 is described below commit b7ccecf32051e1f880826ee21a4b97cfd3a06f87 Author: empcl <1515827...@qq.com> AuthorDate: Mon Mar 18 16:27:09 2024 +0800 [HUDI-7492] Fix the incorrect keygenerator specification for multi partition or multi primary key tables creation (#10840) --- .../org/apache/hudi/table/HoodieTableFactory.java | 7 +--- .../apache/hudi/table/catalog/HoodieCatalog.java | 4 ++ .../hudi/table/catalog/HoodieHiveCatalog.java | 3 ++ .../java/org/apache/hudi/util/StreamerUtil.java| 12 ++ .../hudi/table/catalog/TestHoodieCatalog.java | 43 + .../hudi/table/catalog/TestHoodieHiveCatalog.java | 45 ++ 6 files changed, 108 insertions(+), 6 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java index 68642b39da8..65f0199ae80 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java @@ -28,7 +28,6 @@ import org.apache.hudi.configuration.HadoopConfigurations; import org.apache.hudi.configuration.OptionsResolver; import org.apache.hudi.exception.HoodieValidationException; import org.apache.hudi.index.HoodieIndex; -import org.apache.hudi.keygen.ComplexAvroKeyGenerator; import org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator; import org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator; import org.apache.hudi.util.AvroSchemaConverter; @@ -318,11 +317,7 @@ public class HoodieTableFactory implements DynamicTableSourceFactory, DynamicTab } } boolean complexHoodieKey = pks.length > 1 || partitions.length > 1; -if (complexHoodieKey && FlinkOptions.isDefaultValueDefined(conf, FlinkOptions.KEYGEN_CLASS_NAME)) { - conf.setString(FlinkOptions.KEYGEN_CLASS_NAME, ComplexAvroKeyGenerator.class.getName()); - LOG.info("Table option [{}] is reset to {} because record key or partition path has two or more fields", - FlinkOptions.KEYGEN_CLASS_NAME.key(), ComplexAvroKeyGenerator.class.getName()); -} +StreamerUtil.checkKeygenGenerator(complexHoodieKey, conf); } /** diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java index d25db7d82fa..f9088b4096c 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java @@ -343,6 +343,10 @@ public class HoodieCatalog extends AbstractCatalog { final String partitions = String.join(",", resolvedTable.getPartitionKeys()); conf.setString(FlinkOptions.PARTITION_PATH_FIELD, partitions); options.put(TableOptionProperties.PARTITION_COLUMNS, partitions); + + final String[] pks = conf.getString(FlinkOptions.RECORD_KEY_FIELD).split(","); + boolean complexHoodieKey = pks.length > 1 || resolvedTable.getPartitionKeys().size() > 1; + StreamerUtil.checkKeygenGenerator(complexHoodieKey, conf); } else { conf.setString(FlinkOptions.KEYGEN_CLASS_NAME.key(), NonpartitionedAvroKeyGenerator.class.getName()); } diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java index 3e409d11f5d..ce0230e6939 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java @@ -502,6 +502,9 @@ public class HoodieHiveCatalog extends AbstractCatalog { if (catalogTable.isPartitioned() && !flinkConf.contains(FlinkOptions.PARTITION_PATH_FIELD)) { final String partitions = String.join(",", catalogTable.getPartitionKeys()); flinkConf.setString(FlinkOptions.PARTITION_PATH_FIELD, partitions); + final String[] pks = flinkConf.getString(FlinkOptions.RECORD_KEY_FIELD).split(","); + boolean complexHoodieKey = pks.length > 1 || catalogTable.getPartitionKeys().size() > 1; + StreamerUtil.checkKeygenGenerator(complexHoodieKey, flinkConf); } if (!catalogTable.isPartitioned()) { diff --gi