Re: [PR] [HUDI-6851] Fixing Spark quick start guide [hudi]
bhasudha merged PR #9712: URL: https://github.com/apache/hudi/pull/9712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1334928958 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions und
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1334257222 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: In the current realization `HoodiePartitionMetadata` provides only `commitTime` (partition created commit time) and `partitionDepth` properties. We can add new `lastModifiedTime` property in `.hoodie_partition_metadata`, which is updated on every commit/deltacommit to corresponding partition. We need only to think about migration from version without partition level TTL to a new one with this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]
geserdugarov commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1334248487 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. Review Comment: Sorry to be difficult, it's just providing TTL functionality by custom implementation of `PartitionTTLManagementStrategy` is not user friendly. We want to automate detection of outdated partitions and calling `delete_partition`. Could we just allow user to set partition path specification with TTL value, and implement everything internally? From my point of view, there are two main entities in TTL: - object In our case, it's partition, we define it by using `spec`. - definition of outdating It should be time or something time-dependent. In our case, we could compare difference of a current time and `_hoodie_commit_time` with user-defined delta value. This is a main scope for TTL, and we shouldn't allow to have more flexibility. Customized implementation of `PartitionTTLManagementStrategy` will allow to do anything with partitions. It still could be `PartitionManagementStrategy`, but then we shouldn't named it with `TTL` part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0 [hudi]
codope commented on code in PR #9790: URL: https://github.com/apache/hudi/pull/9790#discussion_r1345180240 ## website/releases/release-0.14.0.md: ## @@ -0,0 +1,339 @@ +--- +title: "Release 0.14.0" +sidebar_position: 1 +layout: releases +toc: true +--- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## [Release 0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) ([docs](/docs/quick-start-guide)) +Apache Hudi 0.14.0 marks a significant milestone with a range of new functionalities and enhancements. +These include the introduction of Record Level Index, automatic generation of record keys, the `hudi_table_changes` +function for incremental reads, and more. Notably, this release also incorporates support for Spark 3.4. On the Flink +front, version 0.14.0 brings several exciting features such as consistent hashing index support, Flink 1.17 support, and U +pdate and Delete statement support. Additionally, this release upgrades the Hudi table version, prompting users to consult +the Migration Guide provided below. We encourage users to review the [release highlights](#release-highlights), +[breaking changes](#breaking-changes), and [behavior changes](#behavior-changes) before +adopting the 0.14.0 release. + + + +## Migration Guide +In version 0.14.0, we've made changes such as the removal of compaction plans from the ".aux" folder and the introduction +of a new log block version. As part of this release, the table version is updated to version `6`. When running a Hudi job +with version 0.14.0 on a table with an older table version, an automatic upgrade process is triggered to bring the table +up to version `6`. This upgrade is a one-time occurrence for each Hudi table, as the `hoodie.table.version` is updated in +the property file upon completion of the upgrade. Additionally, a command-line tool for downgrading has been included, +allowing users to move from table version `6` to `5`, or revert from Hudi 0.14.0 to a version prior to 0.14.0. To use this +tool, execute it from a 0.14.0 environment. For more details, refer to the +[hudi-cli](/docs/cli/#upgrade-and-downgrade-table). + +:::caution +If migrating from an older release (pre 0.14.0), please also check the upgrade instructions from each older release in +sequence. +::: + +### Bundle Updates + + New Spark Bundles +In this release, we've expanded our support to include bundles for both Spark 3.4 +([hudi-spark3.4-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle_2.12)) +and Spark 3.0 ([hudi-spark3.0-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.0-bundle_2.12)). +Please note that, the support for Spark 3.0 had been discontinued after Hudi version 0.10.1, but due to strong community +interest, it has been reinstated in this release. + +### Breaking Changes + + INSERT INTO behavior with Spark SQL +Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL followed the upsert flow, where multiple versions +of records would be merged into one version. However, starting from 0.14.0, we've altered the default behavior of +`INSERT INTO` to utilize the `insert` flow internally. This change significantly enhances write performance as it +bypasses index lookups. + +If a table is created with a *preCombine* key, the default operation for `INSERT INTO` remains as `upsert`. Conversely, +if no *preCombine* key is set, the underlying write operation for `INSERT INTO` defaults to `insert`. Users have the +flexibility to override this behavior by explicitly setting values for the config +[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation) +as per their requirements. Possible values for this config include `insert`, `bulk_insert`, and `upsert`. + +Additionally, in version 0.14.0, we have **deprecated** two related older configs: +- `hoodie.sql.insert.mode` +- `hoodie.sql.bulk.insert.enable`. + +### Behavior changes + + Simplified duplicates handling with Inserts in Spark SQL +In cases where the operation type is configured as `insert` for the Spark SQL `INSERT INTO` flow, users now have the +option to enforce a duplicate policy using the configuration setting +[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy). +This policy determines the action taken when incoming records being ingested already exist in storage. The available +values for this configuration are as follows: + +- `none`: No specific action is taken, allowing duplicates to exist in the Hudi table if the incoming records contain duplicates. +- `drop`: Matching records from the incoming writes will be dropped, and the remaining ones will be ingested. +- `fail`: The write operation will fail if the same records are re-ingested. In essence, a given record, as determined +by the
Re: [I] commits_.archive is not move to archived folder [hudi]
ad1happy2go commented on issue #9812: URL: https://github.com/apache/hudi/issues/9812#issuecomment-1746106261 @njalan Looks like its more of a upgrade issue. I dont see many active commits in the hoodie file list you pasted in another ticket. Archival will kick in for more than 20-30 commits. Let us know if removal of archived commits from .hoodie directory helped in reducing S3 operations (Your other issue) also. Please let us know if you dont see archival happening when you have more than 30 commits in the timeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]
ad1happy2go commented on issue #9805: URL: https://github.com/apache/hudi/issues/9805#issuecomment-1746102499 @noahtaite Thanks for all the effort. Yes, it should be supported. I saw you did the same `Partition delete with glue sync` as part of your solution. Did you faced any issues when you tried that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0 [hudi]
codope commented on code in PR #9790: URL: https://github.com/apache/hudi/pull/9790#discussion_r1345164763 ## website/releases/release-0.14.0.md: ## @@ -0,0 +1,339 @@ +--- +title: "Release 0.14.0" +sidebar_position: 1 +layout: releases +toc: true +--- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## [Release 0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) ([docs](/docs/quick-start-guide)) +Apache Hudi 0.14.0 marks a significant milestone with a range of new functionalities and enhancements. +These include the introduction of Record Level Index, automatic generation of record keys, the `hudi_table_changes` +function for incremental reads, and more. Notably, this release also incorporates support for Spark 3.4. On the Flink +front, version 0.14.0 brings several exciting features such as consistent hashing index support, Flink 1.17 support, and U +pdate and Delete statement support. Additionally, this release upgrades the Hudi table version, prompting users to consult +the Migration Guide provided below. We encourage users to review the [release highlights](#release-highlights), +[breaking changes](#breaking-changes), and [behavior changes](#behavior-changes) before +adopting the 0.14.0 release. + + + +## Migration Guide +In version 0.14.0, we've made changes such as the removal of compaction plans from the ".aux" folder and the introduction +of a new log block version. As part of this release, the table version is updated to version `6`. When running a Hudi job +with version 0.14.0 on a table with an older table version, an automatic upgrade process is triggered to bring the table +up to version `6`. This upgrade is a one-time occurrence for each Hudi table, as the `hoodie.table.version` is updated in +the property file upon completion of the upgrade. Additionally, a command-line tool for downgrading has been included, +allowing users to move from table version `6` to `5`, or revert from Hudi 0.14.0 to a version prior to 0.14.0. To use this +tool, execute it from a 0.14.0 environment. For more details, refer to the +[hudi-cli](/docs/cli/#upgrade-and-downgrade-table). + +:::caution +If migrating from an older release (pre 0.14.0), please also check the upgrade instructions from each older release in +sequence. +::: + +### Bundle Updates + + New Spark Bundles +In this release, we've expanded our support to include bundles for both Spark 3.4 +([hudi-spark3.4-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle_2.12)) +and Spark 3.0 ([hudi-spark3.0-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.0-bundle_2.12)). +Please note that, the support for Spark 3.0 had been discontinued after Hudi version 0.10.1, but due to strong community +interest, it has been reinstated in this release. + +### Breaking Changes + + INSERT INTO behavior with Spark SQL +Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL followed the upsert flow, where multiple versions +of records would be merged into one version. However, starting from 0.14.0, we've altered the default behavior of +`INSERT INTO` to utilize the `insert` flow internally. This change significantly enhances write performance as it +bypasses index lookups. + +If a table is created with a *preCombine* key, the default operation for `INSERT INTO` remains as `upsert`. Conversely, +if no *preCombine* key is set, the underlying write operation for `INSERT INTO` defaults to `insert`. Users have the +flexibility to override this behavior by explicitly setting values for the config +[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation) +as per their requirements. Possible values for this config include `insert`, `bulk_insert`, and `upsert`. + +Additionally, in version 0.14.0, we have **deprecated** two related older configs: +- `hoodie.sql.insert.mode` +- `hoodie.sql.bulk.insert.enable`. + +### Behavior changes + + Simplified duplicates handling with Inserts in Spark SQL +In cases where the operation type is configured as `insert` for the Spark SQL `INSERT INTO` flow, users now have the +option to enforce a duplicate policy using the configuration setting +[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy). +This policy determines the action taken when incoming records being ingested already exist in storage. The available +values for this configuration are as follows: + +- `none`: No specific action is taken, allowing duplicates to exist in the Hudi table if the incoming records contain duplicates. +- `drop`: Matching records from the incoming writes will be dropped, and the remaining ones will be ingested. +- `fail`: The write operation will fail if the same records are re-ingested. In essence, a given record, as determined +by the
Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]
danny0405 commented on issue #9804: URL: https://github.com/apache/hudi/issues/9804#issuecomment-1746061222 It should be like this: `--hoodie-conf k1=v1,k2=v2`, for your opition, it should be `--hoodie-conf hadoop.parquet.avro.readInt96AsFixed=true` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]
danny0405 commented on issue #9804: URL: https://github.com/apache/hudi/issues/9804#issuecomment-1746059316 Did you check your `.hoodie/hoodie.properties` file to see whether there is a table schema option? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6914) String type partition value returned for a query on table partitioned by integer
[ https://issues.apache.org/jira/browse/HUDI-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6914: -- Description: Partition a table on a non-string field. If I query a table partitioned on the partition field, the value is returned as a string even though the type is int in the schema. Happens only when using ComplexKeyGenerator and CustomKeyGenerator. (was: Partition a table on a non-string field. If I query a table partitioned on the partition field, the value is returned as a string even though the type is int in the schema) > String type partition value returned for a query on table partitioned by > integer > > > Key: HUDI-6914 > URL: https://issues.apache.org/jira/browse/HUDI-6914 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Assignee: Jonathan Vexler >Priority: Major > Fix For: 0.14.1 > > > Partition a table on a non-string field. If I query a table partitioned on > the partition field, the value is returned as a string even though the type > is int in the schema. Happens only when using ComplexKeyGenerator and > CustomKeyGenerator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
codope commented on code in PR #9581: URL: https://github.com/apache/hudi/pull/9581#discussion_r1344501029 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDeleteBlock.java: ## @@ -65,17 +69,44 @@ public class HoodieDeleteBlock extends HoodieLogBlock { private static final Lazy HOODIE_DELETE_RECORD_BUILDER_STUB = Lazy.lazily(HoodieDeleteRecord::newBuilder); + private final boolean writeRecordPositions; + // Records to delete, sorted based on the record position if writing record position to the log block header private DeleteRecord[] recordsToDelete; - public HoodieDeleteBlock(DeleteRecord[] recordsToDelete, Map header) { -this(Option.empty(), null, false, Option.empty(), header, new HashMap<>()); -this.recordsToDelete = recordsToDelete; + public HoodieDeleteBlock(List> recordsToDelete, + boolean writeRecordPositions, + Map header) { +this(Option.empty(), null, false, Option.empty(), header, new HashMap<>(), writeRecordPositions); +if (writeRecordPositions) { + recordsToDelete.sort((o1, o2) -> { +long v1 = o1.getRight(); +long v2 = o2.getRight(); +return Long.compare(v1, v2); + }); + if (recordsToDelete.get(0).getRight() > -1L) { +addRecordPositionsToHeader( Review Comment: record position can be invalid (-1) when: 1. current location is not known (new inserts going to log file). 2. when base format is hfile (record position not supported for hfile). ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -89,10 +90,11 @@ public class HoodieAppendHandle extends HoodieWriteHandle records, + boolean writeRecordPositions, Map header, Map footer, String keyFieldName) { super(header, footer, Option.empty(), Option.empty(), null, false); +if (writeRecordPositions) { + records.sort((o1, o2) -> { Review Comment: I've enabled writing record positions for some of the existing tests which includes the following scenarios: 1. MOR table with compaction. 2. MOR table with clustering. 3. MOR table with clustering and no base file (before clustering). 4. COW/MOR table with different index types. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -173,18 +173,18 @@ public static HoodieRecord tagRecord(HoodieRecord record, HoodieRecord * * @param filePath- File to filter keys from * @param candidateRecordKeys - Candidate keys to filter - * @return List of candidate keys that are available in the file + * @return List of pairs of candidate keys and positions that are available in the file */ - public static List filterKeysFromFile(Path filePath, List candidateRecordKeys, -Configuration configuration) throws HoodieIndexException { + public static List> filterKeysFromFile(Path filePath, List candidateRecordKeys, +Configuration configuration) throws HoodieIndexException { ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath)); -List foundRecordKeys = new ArrayList<>(); +List> foundRecordKeys = new ArrayList<>(); try (HoodieFileReader fileReader = HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO) .getFileReader(configuration, filePath)) { // Load all rowKeys from the file, to double-confirm if (!candidateRecordKeys.isEmpty()) { HoodieTimer timer = HoodieTimer.start(); -Set fileRowKeys = fileReader.filterRowKeys(new TreeSet<>(candidateRecordKeys)); +Set> fileRowKeys = fileReader.filterRowKeys(new TreeSet<>(candidateRecordKeys)); Review Comment: Keys should be sorted for HFile reader. For others, it doesn't matter as long as it is a set (for efficient contains check). Incorporated above suggestion - here we just pass a set and init `SortedSet` in `HoodieHFileAvroReader` only instead of building `TreeSet` for every other format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query
[ https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771687#comment-17771687 ] Lin Liu commented on HUDI-6786: --- I met some serliazable issues and trying to fix it. > Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR > Snapshot Query > -- > > Key: HUDI-6786 > URL: https://issues.apache.org/jira/browse/HUDI-6786 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > > Goal: When `NewHoodieParquetFileFormat` is enabled with > `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR > Snapshot query should use HoodieFileGroupReader. All relevant tests on basic > MOR snapshot query should pass (except for the caveats in the current > HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in > this EPIC). > The query logic is implemented in > `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the > following code for MOR snapshot query: > {code:java} > else { > if (logFiles.nonEmpty) { > val baseFile = createPartitionedFile(InternalRow.empty, > hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen) > buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, > filePath.getParent, requiredSchemaWithMandatory, > requiredSchemaWithMandatory, outputSchema, partitionSchema, > partitionValues, broadcastedHadoopConf.value.value) > } else { > throw new IllegalStateException("should not be here since file slice > should not have been broadcasted since it has no log or data files") > //baseFileReader(baseFile) > } {code} > `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, > with a new config `hoodie.read.use.new.file.group.reader`, by passing in the > correct base and log file list. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path
[ https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6795: - Assignee: Sagar Sumit (was: Ethan Guo) > Implement generation of record_positions for updates and deletes on write path > -- > > Key: HUDI-6795 > URL: https://issues.apache.org/jira/browse/HUDI-6795 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Hoodie MAGIC was written twice to a log file [hudi]
danny0405 commented on issue #8887: URL: https://github.com/apache/hudi/issues/8887#issuecomment-1746012903 @dat-vikash Thanks for the feedback, let's see whether #8526 can solve this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6902) Detect flaky tests
[ https://issues.apache.org/jira/browse/HUDI-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6902: -- Description: Step 1: Create a dummy PR and try to trigger the errors if possible. 1. The integration test constantly fails. 2. Some random failures: [https://github.com/apache/hudi/actions/runs/6396038672] was:Step 1: Create a dummy PR and try to trigger the errors if possible. > Detect flaky tests > -- > > Key: HUDI-6902 > URL: https://issues.apache.org/jira/browse/HUDI-6902 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > Step 1: Create a dummy PR and try to trigger the errors if possible. > 1. The integration test constantly fails. > 2. Some random failures: > [https://github.com/apache/hudi/actions/runs/6396038672] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query
[ https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771669#comment-17771669 ] Lin Liu commented on HUDI-6786: --- So far I have created the `HoodieFileGroupReader` object inside of `NewHoodieParquetFileFormat` for MOR tables with log files. I am running the test of `TestNewHoodieParquetFileFormat` to detect some shadow bugs. > Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR > Snapshot Query > -- > > Key: HUDI-6786 > URL: https://issues.apache.org/jira/browse/HUDI-6786 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > > Goal: When `NewHoodieParquetFileFormat` is enabled with > `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR > Snapshot query should use HoodieFileGroupReader. All relevant tests on basic > MOR snapshot query should pass (except for the caveats in the current > HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in > this EPIC). > The query logic is implemented in > `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the > following code for MOR snapshot query: > {code:java} > else { > if (logFiles.nonEmpty) { > val baseFile = createPartitionedFile(InternalRow.empty, > hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen) > buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, > filePath.getParent, requiredSchemaWithMandatory, > requiredSchemaWithMandatory, outputSchema, partitionSchema, > partitionValues, broadcastedHadoopConf.value.value) > } else { > throw new IllegalStateException("should not be here since file slice > should not have been broadcasted since it has no log or data files") > //baseFileReader(baseFile) > } {code} > `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, > with a new config `hoodie.read.use.new.file.group.reader`, by passing in the > correct base and log file list. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]
linliu-code commented on PR #9809: URL: https://github.com/apache/hudi/pull/9809#issuecomment-1745861430 Test failures seems not related. Ran failing tests in local, and they passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1745804741 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * 56aa98d5988a61597e76208f7d16018671e989bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20209) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1745750678 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * f627d9bb49a3bb4038b411caebbe31791a194073 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20198) * 56aa98d5988a61597e76208f7d16018671e989bc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]
noahtaite commented on issue #9805: URL: https://github.com/apache/hudi/issues/9805#issuecomment-1745641435 I will also do one more experiment to try a manual glue sync in between to see if that fixes the partitions as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]
noahtaite commented on issue #9805: URL: https://github.com/apache/hudi/issues/9805#issuecomment-1745640608 Hey @ad1happy2go Reproduced and workaround found in my dev environment as follows: ``` Reproduce - Generate table with glue sync - Partition delete without glue sync - Partitions aren’t removed from glue - Bulk insert with glue sync - Partitions aren’t there Following works as expected: - Generate table with glue sync - Partition delete with glue sync - Partitions are removed from glue - Bulk insert with glue sync - Partitions are there ``` So it seems I can get the expected behaviour by configuring my DELETE_PARTITION write to use AWS Glue sync as well. Assuming that next bulk_insert is doing glue sync across the replacecommit + deltacommit so dropping those incoming partitions. Maybe this is just a documentation issue more than anything? There is not much documentation on how to use DELETE_PARTITION operation wholly, with the best example (IMO) being this video by @soumilshah1995 : https://www.youtube.com/watch?v=QqCiycIgSFk&t=387s In this video he has Glue sync disabled for DELETE_PARTITION, which I thought must be necessary for delete_partition to work. Is enabling glue sync for DELETE_PARTITION operation supported? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6907) E2E support HoodieSparkRecord
[ https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu closed HUDI-6907. - Resolution: Done > E2E support HoodieSparkRecord > -- > > Key: HUDI-6907 > URL: https://issues.apache.org/jira/browse/HUDI-6907 > Project: Apache Hudi > Issue Type: Test >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > As title. > > We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` > is written to the disk. > Though I have traced into the execution and created reasonable workflow, and > did not find any transformation happened, I cannot 100% guarantee it did not > happened at all. We need a better way to confirm that. But it should be OK > for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6908) Verify if any gaps exists for the e2e support
[ https://issues.apache.org/jira/browse/HUDI-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu closed HUDI-6908. - Resolution: Done > Verify if any gaps exists for the e2e support > - > > Key: HUDI-6908 > URL: https://issues.apache.org/jira/browse/HUDI-6908 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6907) E2E support HoodieSparkRecord
[ https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6907: -- Description: As title. We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` is written to the disk. Though I have traced into the execution and created reasonable workflow, and did not find any transformation happened, I cannot 100% guarantee it did not happened at all. We need a better way to confirm that. But it should be OK for now. was: As title. We have confirmed that the `HoodieSparkRecord` is written to the disk. But we haven't confirmed if the `HoodieSparkRecord` has been transformed during the process, though I have traced into the execution and did not find any transformation happened. But I cannot guarantee it did not happend at all. We need a better way to confirm that. But it should be OK for now. > E2E support HoodieSparkRecord > -- > > Key: HUDI-6907 > URL: https://issues.apache.org/jira/browse/HUDI-6907 > Project: Apache Hudi > Issue Type: Test >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > As title. > > We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` > is written to the disk. > Though I have traced into the execution and created reasonable workflow, and > did not find any transformation happened, I cannot 100% guarantee it did not > happened at all. We need a better way to confirm that. But it should be OK > for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6907) E2E support HoodieSparkRecord
[ https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu resolved HUDI-6907. --- > E2E support HoodieSparkRecord > -- > > Key: HUDI-6907 > URL: https://issues.apache.org/jira/browse/HUDI-6907 > Project: Apache Hudi > Issue Type: Test >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > As title. > > We have confirmed that the `HoodieSparkRecord` is written to the disk. > But we haven't confirmed if the `HoodieSparkRecord` has been transformed > during the process, though I have traced into the execution and did not find > any transformation happened. But I cannot guarantee it did not happend at > all. We need a better way to confirm that. But it should be OK for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6907) E2E support HoodieSparkRecord
[ https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771594#comment-17771594 ] Lin Liu commented on HUDI-6907: --- We will close this task for now. We may need to revisit it later. > E2E support HoodieSparkRecord > -- > > Key: HUDI-6907 > URL: https://issues.apache.org/jira/browse/HUDI-6907 > Project: Apache Hudi > Issue Type: Test >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > As title. > > We have confirmed that the `HoodieSparkRecord` is written to the disk. > But we haven't confirmed if the `HoodieSparkRecord` has been transformed > during the process, though I have traced into the execution and did not find > any transformation happened. But I cannot guarantee it did not happend at > all. We need a better way to confirm that. But it should be OK for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6907) E2E support HoodieSparkRecord
[ https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6907: -- Description: As title. We have confirmed that the `HoodieSparkRecord` is written to the disk. But we haven't confirmed if the `HoodieSparkRecord` has been transformed during the process, though I have traced into the execution and did not find any transformation happened. But I cannot guarantee it did not happend at all. We need a better way to confirm that. But it should be OK for now. was:As title. > E2E support HoodieSparkRecord > -- > > Key: HUDI-6907 > URL: https://issues.apache.org/jira/browse/HUDI-6907 > Project: Apache Hudi > Issue Type: Test >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > As title. > > We have confirmed that the `HoodieSparkRecord` is written to the disk. > But we haven't confirmed if the `HoodieSparkRecord` has been transformed > during the process, though I have traced into the execution and did not find > any transformation happened. But I cannot guarantee it did not happend at > all. We need a better way to confirm that. But it should be OK for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745488652 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * 1a11ff678d2345105879a6faa951c18d94dfa1ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20208) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] NotSerializableException using SparkRDDWriteClient with OCC and DynamoDBBasedLockProvider [hudi]
ehurheap commented on issue #9807: URL: https://github.com/apache/hudi/issues/9807#issuecomment-1745464144 Yes. I can write a dataframe to the same table, for example: ``` data.write .format("org.apache.hudi.Spark32PlusDefaultSource") .options(writeWithLocking) .mode("append") .save(tablePath) ``` where writeWithLocking options are: ``` (hoodie.bulkinsert.shuffle.parallelism,2) (hoodie.bulkinsert.sort.mode,NONE) (hoodie.clean.async,false) (hoodie.clean.automatic,false) (hoodie.cleaner.policy.failed.writes,LAZY) (hoodie.combine.before.insert,false) (hoodie.compact.inline,false) (hoodie.compact.schedule.inline,false) (hoodie.datasource.compaction.async.enable,false) (hoodie.datasource.write.hive_style_partitioning,true) (hoodie.datasource.write.keygenerator.class,org.apache.spark.sql.hudi.command.UuidKeyGenerator) (hoodie.datasource.write.operation,bulk_insert) (hoodie.datasource.write.partitionpath.field,env_id,week) (hoodie.datasource.write.precombine.field,schematized_at) (hoodie.datasource.write.recordkey.field,env_id,user_id) (hoodie.datasource.write.row.writer.enable,false) (hoodie.datasource.write.table.type,MERGE_ON_READ) (hoodie.metadata.enable,false) (hoodie.table.name,users_changes) (hoodie.write.concurrency.mode,OPTIMISTIC_CONCURRENCY_CONTROL) (hoodie.write.lock.dynamodb.endpoint_url,http://localhost:8000) (hoodie.write.lock.dynamodb.partition_key,users_changes-us-east-1-local) (hoodie.write.lock.dynamodb.region,us-east-1) (hoodie.write.lock.dynamodb.table,datalake-locks) (hoodie.write.lock.provider,org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider) ``` These locking configs are also in our production ingestion which writes to hudi using spark structured streaming without error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6796: -- Status: Patch Available (was: In Progress) > Implement position-based deletes in FileGroupReader > --- > > Key: HUDI-6796 > URL: https://issues.apache.org/jira/browse/HUDI-6796 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6797) Implement position-based updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6797: - Assignee: Sagar Sumit > Implement position-based updates in FileGroupReader > --- > > Key: HUDI-6797 > URL: https://issues.apache.org/jira/browse/HUDI-6797 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6797) Implement position-based updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6797: -- Status: In Progress (was: Open) > Implement position-based updates in FileGroupReader > --- > > Key: HUDI-6797 > URL: https://issues.apache.org/jira/browse/HUDI-6797 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6796: - Labels: pull-request-available (was: ) > Implement position-based deletes in FileGroupReader > --- > > Key: HUDI-6796 > URL: https://issues.apache.org/jira/browse/HUDI-6796 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6796][WIP] Use position-based deletes in FileGroupReader [hudi]
codope opened a new pull request, #9818: URL: https://github.com/apache/hudi/pull/9818 ### Change Logs Stacked on top of #9581 Main changes in this PR: - Add a new implementation `HoodiePositionBasedMergedLogRecordReader` that uses record positions to do merging. - Add following methods in `BaseHoodieLogRecordReader`: `processNextDeletePosition(long position)` and `processNextRecord(T record, Map metadata, Option position)`. These are used for position based merging. Positions are available from the log block headers. ### Impact Improved log record reader performance, better than key-based merging. ### Risk level (write none, low medium or high below) low The new reader is used only when `shouldUseRecordPositions` is set to true. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6796: -- Status: In Progress (was: Open) > Implement position-based deletes in FileGroupReader > --- > > Key: HUDI-6796 > URL: https://issues.apache.org/jira/browse/HUDI-6796 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6796) Implement position-based deletes in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6796: - Assignee: Sagar Sumit > Implement position-based deletes in FileGroupReader > --- > > Key: HUDI-6796 > URL: https://issues.apache.org/jira/browse/HUDI-6796 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745398875 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * 20921d8e6034d38eaefb739c021f4324b94f6803 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20207) * 1a11ff678d2345105879a6faa951c18d94dfa1ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20208) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]
linliu-code commented on code in PR #9809: URL: https://github.com/apache/hudi/pull/9809#discussion_r1344415595 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + + /** + * In some cases a business logic does some checks before flushing a merged record to the disk. + * This method does the check and the returned value contains two boolean variables. + * + * The first variable indicates if the merged record should be flushed to the disk or not. + * The second variable takes effect only when the first one is false, and it indicates if + * the old record should be kept or not. That is, + * (1) (true, _): the merged one is flushed to the disk; the old record is skipped. + * (2) (false, false): both records skipped, a delete operation. + * (3) (false, true): only the old record flushed to the disk. + * + * @param record the merged record. + * @param schema the schema of the merged record. + * @return a pair of boolean variables to indicate the flush decision. + * + * This interface is experimental and might be evolved in the future. + **/ + default Pair shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException { Review Comment: @danny0405 , saw your comment in the slack, we can use the simple signature right now since we can evolve it when we need in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]
linliu-code commented on code in PR #9809: URL: https://github.com/apache/hudi/pull/9809#discussion_r1344411636 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + + /** + * In some cases a business logic does some checks before flushing a merged record to the disk. + * This method does the check and the returned value contains two boolean variables. + * + * The first variable indicates if the merged record should be flushed to the disk or not. + * The second variable takes effect only when the first one is false, and it indicates if + * the old record should be kept or not. That is, + * (1) (true, _): the merged one is flushed to the disk; the old record is skipped. + * (2) (false, false): both records skipped, a delete operation. + * (3) (false, true): only the old record flushed to the disk. + * + * @param record the merged record. + * @param schema the schema of the merged record. + * @return a pair of boolean variables to indicate the flush decision. + * + * This interface is experimental and might be evolved in the future. + **/ + default Pair shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException { Review Comment: > > This question could be very critical, > > I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up. > > We can evolve the returned value as a `Pair` or `Enum` if there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me. Even in current implementation of `HoodieMergeHandle`, we are still facing this problem: when the `shouldFlush` function returns false, should we return true or false in `writeRecord` function? Returning true means skipping the old record, false means keeping the old record. No matter which one we choose in advance, we still face the possible situation: what if a user wants the other way? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745336244 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206) * 20921d8e6034d38eaefb739c021f4324b94f6803 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20207) * 1a11ff678d2345105879a6faa951c18d94dfa1ba UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745317180 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206) * 20921d8e6034d38eaefb739c021f4324b94f6803 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] cleaner blocked due to HoodieRollbackException and FileAlreadyExistsException [hudi]
ehurheap commented on issue #9796: URL: https://github.com/apache/hudi/issues/9796#issuecomment-1745307682 Just noting that targeting fewer commits per cleaner run was successful and the cleaner has completed successfully for several runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hoodie MAGIC was written twice to a log file [hudi]
dat-vikash commented on issue #8887: URL: https://github.com/apache/hudi/issues/8887#issuecomment-1745309477 Bump on this. Also observing this issue with 0.13.1 and flink 1.16.1 . The flink job continues to fail unless we manually delete those compaction.requests files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6892) ExternalSpillableMap may cause data duplication when flink compaction
[ https://issues.apache.org/jira/browse/HUDI-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linleicheng closed HUDI-6892. - Fix Version/s: 1.0.0 Resolution: Fixed > ExternalSpillableMap may cause data duplication when flink compaction > - > > Key: HUDI-6892 > URL: https://issues.apache.org/jira/browse/HUDI-6892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Linleicheng >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > > reproduce: > 1、fullfill in-memory map with records, and let this.inMemoryMap.size() % > NUMBER_OF_RECORDS_TO_ESTIMATE_PAYLOAD_SIZE == 0 > 2、insert a record with key1 into ExternalSpillableMap (which will cause size > estimate and make sure the currentInMemoryMapSize is still greater than or > equal to the maxInMemorySizeInBytes). > it will be spilled to disk. > 3、Reduce the size of record of key1 which will make the > currentInMemoryMapSize less than maxInMemorySizeInBytes when put into > ExternalSpillableMap > it will be put into in-memory map. > > data duplication when iterator finally. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6892) ExternalSpillableMap may cause data duplication when flink compaction
[ https://issues.apache.org/jira/browse/HUDI-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linleicheng updated HUDI-6892: -- Affects Version/s: (was: 0.14.0) Priority: Critical (was: Major) > ExternalSpillableMap may cause data duplication when flink compaction > - > > Key: HUDI-6892 > URL: https://issues.apache.org/jira/browse/HUDI-6892 > Project: Apache Hudi > Issue Type: Bug >Reporter: Linleicheng >Priority: Critical > Labels: pull-request-available > > reproduce: > 1、fullfill in-memory map with records, and let this.inMemoryMap.size() % > NUMBER_OF_RECORDS_TO_ESTIMATE_PAYLOAD_SIZE == 0 > 2、insert a record with key1 into ExternalSpillableMap (which will cause size > estimate and make sure the currentInMemoryMapSize is still greater than or > equal to the maxInMemorySizeInBytes). > it will be spilled to disk. > 3、Reduce the size of record of key1 which will make the > currentInMemoryMapSize less than maxInMemorySizeInBytes when put into > ExternalSpillableMap > it will be put into in-memory map. > > data duplication when iterator finally. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]
li-ang-666 commented on issue #9804: URL: https://github.com/apache/hudi/issues/9804#issuecomment-1745007844 > parquet.avro.readInt96AsFixed now I changed the pom to version-0.14.0, but how could i use this option(parquet.avro.readInt96AsFixed) in online compaction? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] commits_.archive is not move to archived folder [hudi]
njalan commented on issue #9812: URL: https://github.com/apache/hudi/issues/9812#issuecomment-1744988210 metadata is not enable on hudi 0.9. I listed the details for one table in ticket https://github.com/apache/hudi/issues/9751. There are like around 10% of my tables has archive commits files in archived folder. I am using default parameter. But I didn't face any data issues for these tables without new archiving commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]
noahtaite commented on issue #9805: URL: https://github.com/apache/hudi/issues/9805#issuecomment-1744956625 Hey @ad1happy2go thank you for the response. I will give that experiment a try in my dev environment today and let you know. I will do the following and let you know the result: - Bulk insert table with multiple partitions (datasource=[1-2],year=[2000-2023],month=[1-9]) - Run delete_partition on datasource=1/* - Run glue sync. Verify partitions are removed from Glue. - Re-ingest good partitions datasource=1,year=[2010-2023],month=[1-10] - Run glue sync. Hopefully partitions can be added back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]
hudi-bot commented on PR #9776: URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744851626 ## CI report: * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN * 268d48b2f47310fd490b80052eff3a5d01aea5c9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20205) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744751092 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path
[ https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6795: -- Reviewers: Lin Liu, Vinoth Chandar (was: sivabalan narayanan, Vinoth Chandar) > Implement generation of record_positions for updates and deletes on write path > -- > > Key: HUDI-6795 > URL: https://issues.apache.org/jira/browse/HUDI-6795 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744672834 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204) * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]
hudi-bot commented on PR #9776: URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744658477 ## CI report: * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN * 06eb344da3e6c4ce270a2b63e34908a507aac786 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20202) * 268d48b2f47310fd490b80052eff3a5d01aea5c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20205) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744656800 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204) * b1581950ca129d2753c53ec77c0bf046701b2c92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]
hudi-bot commented on PR #9776: URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744589641 ## CI report: * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN * 06eb344da3e6c4ce270a2b63e34908a507aac786 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20202) * 268d48b2f47310fd490b80052eff3a5d01aea5c9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]
danny0405 commented on code in PR #9809: URL: https://github.com/apache/hudi/pull/9809#discussion_r1343793710 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + + /** + * In some cases a business logic does some checks before flushing a merged record to the disk. + * This method does the check and the returned value contains two boolean variables. + * + * The first variable indicates if the merged record should be flushed to the disk or not. + * The second variable takes effect only when the first one is false, and it indicates if + * the old record should be kept or not. That is, + * (1) (true, _): the merged one is flushed to the disk; the old record is skipped. + * (2) (false, false): both records skipped, a delete operation. + * (3) (false, true): only the old record flushed to the disk. + * + * @param record the merged record. + * @param schema the schema of the merged record. + * @return a pair of boolean variables to indicate the flush decision. + * + * This interface is experimental and might be evolved in the future. + **/ + default Pair shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException { Review Comment: > This question could be very critical, I didn't see such request from any user, even for the contributor from Kuaishou, they just want to keep the merged record or drop it totally. Let's not introduce new semantics if there is no real use case as back-up. We can evolve the returned value as a `Pair` or `Enum` if there are more feedbacks, at this time point, the behavior for keeping the old record seems not clear to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]
linliu-code commented on code in PR #9809: URL: https://github.com/apache/hudi/pull/9809#discussion_r1343534449 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java: ## @@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable { */ Option> merge(HoodieRecord older, Schema oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws IOException; + + /** + * In some cases a business logic does some checks before flushing a merged record to the disk. + * This method does the check and the returned value contains two boolean variables. + * + * The first variable indicates if the merged record should be flushed to the disk or not. + * The second variable takes effect only when the first one is false, and it indicates if + * the old record should be kept or not. That is, + * (1) (true, _): the merged one is flushed to the disk; the old record is skipped. + * (2) (false, false): both records skipped, a delete operation. + * (3) (false, true): only the old record flushed to the disk. + * + * @param record the merged record. + * @param schema the schema of the merged record. + * @return a pair of boolean variables to indicate the flush decision. + * + * This interface is experimental and might be evolved in the future. + **/ + default Pair shouldFlush(HoodieRecord record, Schema schema, TypedProperties props) throws IOException { Review Comment: > I kind of agree, we can simplify the returned value as a true/false. But maybe @linliu-code has some other considerations here, @linliu-code can you clarify. @danny0405, @codope, the reason that we need a pair of boolean variables is that: if a merger decides not to flush the combined record, it faces the question if the old record (the record in the base file) should be kept or not. This question could be very critical, so we should not guess it for the developer who implements their custom merger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744573109 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744555123 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * b71be9bbde91fb6afeb52dfc102ba789963b66ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20203) * a3ece047efd3ef23736d5162211c4176f31468e1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]
ad1happy2go commented on issue #9751: URL: https://github.com/apache/hudi/issues/9751#issuecomment-1744541419 @njalan Do you also see similar behaviour for the tables which got written with later versions of hudi (0.13) only and not 0.9. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Is this the expected number of S3 calls? [hudi]
ad1happy2go commented on issue #9612: URL: https://github.com/apache/hudi/issues/9612#issuecomment-1744535879 There are quite a few after 0.11. Examples - https://github.com/apache/hudi/pull/7404 https://github.com/apache/hudi/pull/7436 https://github.com/apache/hudi/pull/7404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] NotSerializableException using SparkRDDWriteClient with OCC and DynamoDBBasedLockProvider [hudi]
ad1happy2go commented on issue #9807: URL: https://github.com/apache/hudi/issues/9807#issuecomment-1744517639 @ehurheap Did you tried the same lock configuration with a normal insert on test table to ensure configurations are good? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744462777 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587) * b71be9bbde91fb6afeb52dfc102ba789963b66ff Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20203) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-173328 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587) * b71be9bbde91fb6afeb52dfc102ba789963b66ff UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org