Re: [PR] [HUDI-6851] Fixing Spark quick start guide [hudi]

2023-10-03 Thread via GitHub


bhasudha merged PR #9712:
URL: https://github.com/apache/hudi/pull/9712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]

2023-10-03 Thread via GitHub


geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1334928958


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.
+
+For file groups generated by replace commit, it may not reveal the real 
insert/update time for the file group. However, we can assume that we won't do 
clustering for a partition without new writes for a long time when using the 
strategy. And in the future, we may introduce a more accurate mechanism to get 
`lastModifiedTime` of a partition, for example using metadata table. 
+
+### Apply different strategies for different partitions
+
+For some specific users, they may want to apply different strategies for 
different partitions. For example, they may have multi partition 
fileds(productId, day). For partitions und

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]

2023-10-03 Thread via GitHub


geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1334257222


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   In the current realization `HoodiePartitionMetadata` provides only 
`commitTime` (partition created commit time) and `partitionDepth` properties. 
We can add new `lastModifiedTime` property in `.hoodie_partition_metadata`, 
which is updated on every commit/deltacommit to corresponding partition.
   
   We need only to think about migration from version without partition level 
TTL to a new one with this feature.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail

Re: [PR] [HUDI-5823][RFC-65] RFC for Partition TTL Management [hudi]

2023-10-03 Thread via GitHub


geserdugarov commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1334248487


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.

Review Comment:
   Sorry to be difficult, it's just providing TTL functionality by custom 
implementation of `PartitionTTLManagementStrategy` is not user friendly.
   We want to automate detection of outdated partitions and calling 
`delete_partition`. Could we just allow user to set partition path 
specification with TTL value, and implement everything internally?
   
   From my point of view, there are two main entities in TTL:
   - object
   In our case, it's partition, we define it by using `spec`.
   - definition of outdating
   It should be time or something time-dependent. In our case, we could compare 
difference of a current time and `_hoodie_commit_time` with user-defined delta 
value.
   
   This is a main scope for TTL, and we shouldn't allow to have more 
flexibility.
   Customized implementation of `PartitionTTLManagementStrategy` will allow to 
do anything with partitions. It still could be `PartitionManagementStrategy`, 
but then we shouldn't named it with `TTL` part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0 [hudi]

2023-10-03 Thread via GitHub


codope commented on code in PR #9790:
URL: https://github.com/apache/hudi/pull/9790#discussion_r1345180240


##
website/releases/release-0.14.0.md:
##
@@ -0,0 +1,339 @@
+---
+title: "Release 0.14.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) 
([docs](/docs/quick-start-guide))
+Apache Hudi 0.14.0 marks a significant milestone with a range of new 
functionalities and enhancements. 
+These include the introduction of Record Level Index, automatic generation of 
record keys, the `hudi_table_changes` 
+function for incremental reads, and more. Notably, this release also 
incorporates support for Spark 3.4. On the Flink 
+front, version 0.14.0 brings several exciting features such as consistent 
hashing index support, Flink 1.17 support, and U
+pdate and Delete statement support. Additionally, this release upgrades the 
Hudi table version, prompting users to consult
+the Migration Guide provided below. We encourage users to review the [release 
highlights](#release-highlights),
+[breaking changes](#breaking-changes), and [behavior 
changes](#behavior-changes) before 
+adopting the 0.14.0 release.
+
+
+
+## Migration Guide
+In version 0.14.0, we've made changes such as the removal of compaction plans 
from the ".aux" folder and the introduction
+of a new log block version. As part of this release, the table version is 
updated to version `6`. When running a Hudi job 
+with version 0.14.0 on a table with an older table version, an automatic 
upgrade process is triggered to bring the table 
+up to version `6`. This upgrade is a one-time occurrence for each Hudi table, 
as the `hoodie.table.version` is updated in
+the property file upon completion of the upgrade. Additionally, a command-line 
tool for downgrading has been included, 
+allowing users to move from table version `6` to `5`, or revert from Hudi 
0.14.0 to a version prior to 0.14.0. To use this 
+tool, execute it from a 0.14.0 environment. For more details, refer to the 
+[hudi-cli](/docs/cli/#upgrade-and-downgrade-table).
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+### Bundle Updates
+
+ New Spark Bundles
+In this release, we've expanded our support to include bundles for both Spark 
3.4 
+([hudi-spark3.4-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle_2.12))
 
+and Spark 3.0 
([hudi-spark3.0-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.0-bundle_2.12)).
+Please note that, the support for Spark 3.0 had been discontinued after Hudi 
version 0.10.1, but due to strong community 
+interest, it has been reinstated in this release.
+
+### Breaking Changes
+
+ INSERT INTO behavior with Spark SQL
+Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL 
followed the upsert flow, where multiple versions 
+of records would be merged into one version. However, starting from 0.14.0, 
we've altered the default behavior of 
+`INSERT INTO` to utilize the `insert` flow internally. This change 
significantly enhances write performance as it 
+bypasses index lookups.
+
+If a table is created with a *preCombine* key, the default operation for 
`INSERT INTO` remains as `upsert`. Conversely, 
+if no *preCombine* key is set, the underlying write operation for `INSERT 
INTO` defaults to `insert`. Users have the 
+flexibility to override this behavior by explicitly setting values for the 
config 
+[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation)
 
+as per their requirements. Possible values for this config include `insert`, 
`bulk_insert`, and `upsert`.
+
+Additionally, in version 0.14.0, we have **deprecated** two related older 
configs:
+- `hoodie.sql.insert.mode`
+- `hoodie.sql.bulk.insert.enable`.
+
+### Behavior changes
+
+ Simplified duplicates handling with Inserts in Spark SQL
+In cases where the operation type is configured as `insert` for the Spark SQL 
`INSERT INTO` flow, users now have the 
+option to enforce a duplicate policy using the configuration setting 
+[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy).
 
+This policy determines the action taken when incoming records being ingested 
already exist in storage. The available 
+values for this configuration are as follows:
+
+- `none`: No specific action is taken, allowing duplicates to exist in the 
Hudi table if the incoming records contain duplicates.
+- `drop`: Matching records from the incoming writes will be dropped, and the 
remaining ones will be ingested.
+- `fail`: The write operation will fail if the same records are re-ingested. 
In essence, a given record, as determined 
+by the

Re: [I] commits_.archive is not move to archived folder [hudi]

2023-10-03 Thread via GitHub


ad1happy2go commented on issue #9812:
URL: https://github.com/apache/hudi/issues/9812#issuecomment-1746106261

   @njalan Looks like its more of a upgrade issue. I dont see many active 
commits in the hoodie file list you pasted in another ticket. Archival will 
kick in for more than 20-30 commits. Let us know if removal of archived commits 
from .hoodie directory helped in reducing S3 operations (Your other issue) 
also. 
   
   Please let us know if you dont see archival happening when you have more 
than 30 commits in the timeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]

2023-10-03 Thread via GitHub


ad1happy2go commented on issue #9805:
URL: https://github.com/apache/hudi/issues/9805#issuecomment-1746102499

   @noahtaite Thanks for all the effort. Yes, it should be supported. I saw you 
did the same `Partition delete with glue sync` as part of your solution. Did 
you faced any issues when you tried that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP][DO NOT MERGE][DOCS] Add release notes for 0.14.0 [hudi]

2023-10-03 Thread via GitHub


codope commented on code in PR #9790:
URL: https://github.com/apache/hudi/pull/9790#discussion_r1345164763


##
website/releases/release-0.14.0.md:
##
@@ -0,0 +1,339 @@
+---
+title: "Release 0.14.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
0.14.0](https://github.com/apache/hudi/releases/tag/release-0.14.0) 
([docs](/docs/quick-start-guide))
+Apache Hudi 0.14.0 marks a significant milestone with a range of new 
functionalities and enhancements. 
+These include the introduction of Record Level Index, automatic generation of 
record keys, the `hudi_table_changes` 
+function for incremental reads, and more. Notably, this release also 
incorporates support for Spark 3.4. On the Flink 
+front, version 0.14.0 brings several exciting features such as consistent 
hashing index support, Flink 1.17 support, and U
+pdate and Delete statement support. Additionally, this release upgrades the 
Hudi table version, prompting users to consult
+the Migration Guide provided below. We encourage users to review the [release 
highlights](#release-highlights),
+[breaking changes](#breaking-changes), and [behavior 
changes](#behavior-changes) before 
+adopting the 0.14.0 release.
+
+
+
+## Migration Guide
+In version 0.14.0, we've made changes such as the removal of compaction plans 
from the ".aux" folder and the introduction
+of a new log block version. As part of this release, the table version is 
updated to version `6`. When running a Hudi job 
+with version 0.14.0 on a table with an older table version, an automatic 
upgrade process is triggered to bring the table 
+up to version `6`. This upgrade is a one-time occurrence for each Hudi table, 
as the `hoodie.table.version` is updated in
+the property file upon completion of the upgrade. Additionally, a command-line 
tool for downgrading has been included, 
+allowing users to move from table version `6` to `5`, or revert from Hudi 
0.14.0 to a version prior to 0.14.0. To use this 
+tool, execute it from a 0.14.0 environment. For more details, refer to the 
+[hudi-cli](/docs/cli/#upgrade-and-downgrade-table).
+
+:::caution
+If migrating from an older release (pre 0.14.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+### Bundle Updates
+
+ New Spark Bundles
+In this release, we've expanded our support to include bundles for both Spark 
3.4 
+([hudi-spark3.4-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.4-bundle_2.12))
 
+and Spark 3.0 
([hudi-spark3.0-bundle_2.12](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.0-bundle_2.12)).
+Please note that, the support for Spark 3.0 had been discontinued after Hudi 
version 0.10.1, but due to strong community 
+interest, it has been reinstated in this release.
+
+### Breaking Changes
+
+ INSERT INTO behavior with Spark SQL
+Before version 0.14.0, data ingested through `INSERT INTO` in Spark SQL 
followed the upsert flow, where multiple versions 
+of records would be merged into one version. However, starting from 0.14.0, 
we've altered the default behavior of 
+`INSERT INTO` to utilize the `insert` flow internally. This change 
significantly enhances write performance as it 
+bypasses index lookups.
+
+If a table is created with a *preCombine* key, the default operation for 
`INSERT INTO` remains as `upsert`. Conversely, 
+if no *preCombine* key is set, the underlying write operation for `INSERT 
INTO` defaults to `insert`. Users have the 
+flexibility to override this behavior by explicitly setting values for the 
config 
+[`hoodie.spark.sql.insert.into.operation`](https://hudi.apache.org/docs/configurations#hoodiesparksqlinsertintooperation)
 
+as per their requirements. Possible values for this config include `insert`, 
`bulk_insert`, and `upsert`.
+
+Additionally, in version 0.14.0, we have **deprecated** two related older 
configs:
+- `hoodie.sql.insert.mode`
+- `hoodie.sql.bulk.insert.enable`.
+
+### Behavior changes
+
+ Simplified duplicates handling with Inserts in Spark SQL
+In cases where the operation type is configured as `insert` for the Spark SQL 
`INSERT INTO` flow, users now have the 
+option to enforce a duplicate policy using the configuration setting 
+[`hoodie.datasource.insert.dup.policy`](https://hudi.apache.org/docs/configurations#hoodiedatasourceinsertduppolicy).
 
+This policy determines the action taken when incoming records being ingested 
already exist in storage. The available 
+values for this configuration are as follows:
+
+- `none`: No specific action is taken, allowing duplicates to exist in the 
Hudi table if the incoming records contain duplicates.
+- `drop`: Matching records from the incoming writes will be dropped, and the 
remaining ones will be ingested.
+- `fail`: The write operation will fail if the same records are re-ingested. 
In essence, a given record, as determined 
+by the

Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]

2023-10-03 Thread via GitHub


danny0405 commented on issue #9804:
URL: https://github.com/apache/hudi/issues/9804#issuecomment-1746061222

   It should be like this: `--hoodie-conf k1=v1,k2=v2`, for your opition, it 
should be `--hoodie-conf hadoop.parquet.avro.readInt96AsFixed=true`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]

2023-10-03 Thread via GitHub


danny0405 commented on issue #9804:
URL: https://github.com/apache/hudi/issues/9804#issuecomment-1746059316

   Did you check your `.hoodie/hoodie.properties` file to see whether there is 
a table schema option?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6914) String type partition value returned for a query on table partitioned by integer

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6914:
--
Description: Partition a table on a non-string field. If I query a table 
partitioned on the partition field, the value is returned as a string even 
though the type is int in the schema. Happens only when using 
ComplexKeyGenerator and CustomKeyGenerator.  (was: Partition a table on a 
non-string field. If I query a table partitioned on the partition field, the 
value is returned as a string even though the type is int in the schema)

> String type partition value returned for a query on table partitioned by 
> integer
> 
>
> Key: HUDI-6914
> URL: https://issues.apache.org/jira/browse/HUDI-6914
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 0.14.1
>
>
> Partition a table on a non-string field. If I query a table partitioned on 
> the partition field, the value is returned as a string even though the type 
> is int in the schema. Happens only when using ComplexKeyGenerator and 
> CustomKeyGenerator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


codope commented on code in PR #9581:
URL: https://github.com/apache/hudi/pull/9581#discussion_r1344501029


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDeleteBlock.java:
##
@@ -65,17 +69,44 @@ public class HoodieDeleteBlock extends HoodieLogBlock {
   private static final Lazy 
HOODIE_DELETE_RECORD_BUILDER_STUB =
   Lazy.lazily(HoodieDeleteRecord::newBuilder);
 
+  private final boolean writeRecordPositions;
+  // Records to delete, sorted based on the record position if writing record 
position to the log block header
   private DeleteRecord[] recordsToDelete;
 
-  public HoodieDeleteBlock(DeleteRecord[] recordsToDelete, 
Map header) {
-this(Option.empty(), null, false, Option.empty(), header, new HashMap<>());
-this.recordsToDelete = recordsToDelete;
+  public HoodieDeleteBlock(List> recordsToDelete,
+   boolean writeRecordPositions,
+   Map header) {
+this(Option.empty(), null, false, Option.empty(), header, new HashMap<>(), 
writeRecordPositions);
+if (writeRecordPositions) {
+  recordsToDelete.sort((o1, o2) -> {
+long v1 = o1.getRight();
+long v2 = o2.getRight();
+return Long.compare(v1, v2);
+  });
+  if (recordsToDelete.get(0).getRight() > -1L) {
+addRecordPositionsToHeader(

Review Comment:
   record position can be invalid (-1) when:
   
   1. current location is not known (new inserts going to log file).
   2. when base format is hfile (record position not supported for hfile).



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -89,10 +90,11 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle records,
+ boolean writeRecordPositions,
  Map header,
  Map footer,
  String keyFieldName) {
 super(header, footer, Option.empty(), Option.empty(), null, false);
+if (writeRecordPositions) {
+  records.sort((o1, o2) -> {

Review Comment:
   I've enabled writing record positions for some of the existing tests which 
includes the following scenarios:
   1. MOR table with compaction.
   2. MOR table with clustering.
   3. MOR table with clustering and no base file (before clustering).
   4. COW/MOR table with different index types.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -173,18 +173,18 @@ public static  HoodieRecord 
tagRecord(HoodieRecord record, HoodieRecord
*
* @param filePath- File to filter keys from
* @param candidateRecordKeys - Candidate keys to filter
-   * @return List of candidate keys that are available in the file
+   * @return List of pairs of candidate keys and positions that are available 
in the file
*/
-  public static List filterKeysFromFile(Path filePath, List 
candidateRecordKeys,
-Configuration configuration) 
throws HoodieIndexException {
+  public static List> filterKeysFromFile(Path filePath, 
List candidateRecordKeys,
+Configuration 
configuration) throws HoodieIndexException {
 ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath));
-List foundRecordKeys = new ArrayList<>();
+List> foundRecordKeys = new ArrayList<>();
 try (HoodieFileReader fileReader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO)
 .getFileReader(configuration, filePath)) {
   // Load all rowKeys from the file, to double-confirm
   if (!candidateRecordKeys.isEmpty()) {
 HoodieTimer timer = HoodieTimer.start();
-Set fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));
+Set> fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));

Review Comment:
   Keys should be sorted for HFile reader. For others, it doesn't matter as 
long as it is a set (for efficient contains check). Incorporated above 
suggestion - here we just pass a set and init `SortedSet` in 
`HoodieHFileAvroReader` only instead of building `TreeSet` for every other 
format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

2023-10-03 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771687#comment-17771687
 ] 

Lin Liu commented on HUDI-6786:
---

I met some serliazable issues and trying to fix it.

> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR 
> Snapshot Query
> --
>
> Key: HUDI-6786
> URL: https://issues.apache.org/jira/browse/HUDI-6786
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with 
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
> Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
> MOR snapshot query should pass (except for the caveats in the current 
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
> this EPIC).
> The query logic is implemented in 
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the 
> following code for MOR snapshot query:
> {code:java}
> else {
>   if (logFiles.nonEmpty) {
> val baseFile = createPartitionedFile(InternalRow.empty, 
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
> buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
> filePath.getParent, requiredSchemaWithMandatory,
>   requiredSchemaWithMandatory, outputSchema, partitionSchema, 
> partitionValues, broadcastedHadoopConf.value.value)
>   } else {
> throw new IllegalStateException("should not be here since file slice 
> should not have been broadcasted since it has no log or data files")
> //baseFileReader(baseFile)
>   } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, 
> with a new config `hoodie.read.use.new.file.group.reader`, by passing in the 
> correct base and log file list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6795:
-

Assignee: Sagar Sumit  (was: Ethan Guo)

> Implement generation of record_positions for updates and deletes on write path
> --
>
> Key: HUDI-6795
> URL: https://issues.apache.org/jira/browse/HUDI-6795
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Hoodie MAGIC was written twice to a log file [hudi]

2023-10-03 Thread via GitHub


danny0405 commented on issue #8887:
URL: https://github.com/apache/hudi/issues/8887#issuecomment-1746012903

   @dat-vikash Thanks for the feedback, let's see whether #8526 can solve this 
problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6902) Detect flaky tests

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6902:
--
Description: 
Step 1: Create a dummy PR and try to trigger the errors if possible.

1. The integration test constantly fails.

2. Some random failures: 
[https://github.com/apache/hudi/actions/runs/6396038672]

  was:Step 1: Create a dummy PR and try to trigger the errors if possible.


> Detect flaky tests
> --
>
> Key: HUDI-6902
> URL: https://issues.apache.org/jira/browse/HUDI-6902
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> Step 1: Create a dummy PR and try to trigger the errors if possible.
> 1. The integration test constantly fails.
> 2. Some random failures: 
> [https://github.com/apache/hudi/actions/runs/6396038672]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

2023-10-03 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771669#comment-17771669
 ] 

Lin Liu commented on HUDI-6786:
---

So far I have created the `HoodieFileGroupReader` object inside of 
`NewHoodieParquetFileFormat` for MOR tables with log files. I am running the 
test of `TestNewHoodieParquetFileFormat` to detect some shadow bugs.

> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR 
> Snapshot Query
> --
>
> Key: HUDI-6786
> URL: https://issues.apache.org/jira/browse/HUDI-6786
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with 
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
> Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
> MOR snapshot query should pass (except for the caveats in the current 
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
> this EPIC).
> The query logic is implemented in 
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the 
> following code for MOR snapshot query:
> {code:java}
> else {
>   if (logFiles.nonEmpty) {
> val baseFile = createPartitionedFile(InternalRow.empty, 
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
> buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
> filePath.getParent, requiredSchemaWithMandatory,
>   requiredSchemaWithMandatory, outputSchema, partitionSchema, 
> partitionValues, broadcastedHadoopConf.value.value)
>   } else {
> throw new IllegalStateException("should not be here since file slice 
> should not have been broadcasted since it has no log or data files")
> //baseFileReader(baseFile)
>   } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, 
> with a new config `hoodie.read.use.new.file.group.reader`, by passing in the 
> correct base and log file list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]

2023-10-03 Thread via GitHub


linliu-code commented on PR #9809:
URL: https://github.com/apache/hudi/pull/9809#issuecomment-1745861430

   Test failures seems not related. Ran failing tests in local, and they passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1745804741

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 56aa98d5988a61597e76208f7d16018671e989bc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Test out of box schema evolution for deltastreamer [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1745750678

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * f627d9bb49a3bb4038b411caebbe31791a194073 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20198)
 
   * 56aa98d5988a61597e76208f7d16018671e989bc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]

2023-10-03 Thread via GitHub


noahtaite commented on issue #9805:
URL: https://github.com/apache/hudi/issues/9805#issuecomment-1745641435

   I will also do one more experiment to try a manual glue sync in between to 
see if that fixes the partitions as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]

2023-10-03 Thread via GitHub


noahtaite commented on issue #9805:
URL: https://github.com/apache/hudi/issues/9805#issuecomment-1745640608

   Hey @ad1happy2go
   
   Reproduced and workaround found in my dev environment as follows:
   ```
   Reproduce
   - Generate table with glue sync
   - Partition delete without glue sync
   - Partitions aren’t removed from glue
   - Bulk insert with glue sync
   - Partitions aren’t there
   
   Following works as expected:
- Generate table with glue sync
   - Partition delete with glue sync
   - Partitions are removed from glue
   - Bulk insert with glue sync
   - Partitions are there
   ```
   
   So it seems I can get the expected behaviour by configuring my 
DELETE_PARTITION write to use AWS Glue sync as well. Assuming that next 
bulk_insert is doing glue sync across the replacecommit + deltacommit so 
dropping those incoming partitions.
   
   Maybe this is just a documentation issue more than anything? There is not 
much documentation on how to use DELETE_PARTITION operation wholly, with the 
best example (IMO) being this video by @soumilshah1995 : 
https://www.youtube.com/watch?v=QqCiycIgSFk&t=387s
   
   In this video he has Glue sync disabled for DELETE_PARTITION, which I 
thought must be necessary for delete_partition to work. Is enabling glue sync 
for DELETE_PARTITION operation supported?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6907) E2E support HoodieSparkRecord

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu closed HUDI-6907.
-
Resolution: Done

> E2E support HoodieSparkRecord 
> --
>
> Key: HUDI-6907
> URL: https://issues.apache.org/jira/browse/HUDI-6907
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> As title.
>  
> We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` 
> is written to the disk.
> Though I have traced into the execution and created reasonable workflow, and 
> did not find any transformation happened, I cannot 100% guarantee it did not 
> happened at all. We need a better way to confirm that. But it should be OK 
> for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6908) Verify if any gaps exists for the e2e support

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu closed HUDI-6908.
-
Resolution: Done

> Verify if any gaps exists for the e2e support
> -
>
> Key: HUDI-6908
> URL: https://issues.apache.org/jira/browse/HUDI-6908
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6907) E2E support HoodieSparkRecord

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6907:
--
Description: 
As title.

 

We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` 
is written to the disk.

Though I have traced into the execution and created reasonable workflow, and 
did not find any transformation happened, I cannot 100% guarantee it did not 
happened at all. We need a better way to confirm that. But it should be OK for 
now.

  was:
As title.

 

We have confirmed that the `HoodieSparkRecord` is written to the disk.

But we haven't confirmed if the `HoodieSparkRecord` has been transformed during 
the process, though I have traced into the execution and did not find any 
transformation happened. But I cannot guarantee it did not happend at all. We 
need a better way to confirm that. But it should be OK for now.


> E2E support HoodieSparkRecord 
> --
>
> Key: HUDI-6907
> URL: https://issues.apache.org/jira/browse/HUDI-6907
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> As title.
>  
> We have confirmed that the `HoodieSparkRecord` payload, that is `InternalRow` 
> is written to the disk.
> Though I have traced into the execution and created reasonable workflow, and 
> did not find any transformation happened, I cannot 100% guarantee it did not 
> happened at all. We need a better way to confirm that. But it should be OK 
> for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6907) E2E support HoodieSparkRecord

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu resolved HUDI-6907.
---

> E2E support HoodieSparkRecord 
> --
>
> Key: HUDI-6907
> URL: https://issues.apache.org/jira/browse/HUDI-6907
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> As title.
>  
> We have confirmed that the `HoodieSparkRecord` is written to the disk.
> But we haven't confirmed if the `HoodieSparkRecord` has been transformed 
> during the process, though I have traced into the execution and did not find 
> any transformation happened. But I cannot guarantee it did not happend at 
> all. We need a better way to confirm that. But it should be OK for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6907) E2E support HoodieSparkRecord

2023-10-03 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771594#comment-17771594
 ] 

Lin Liu commented on HUDI-6907:
---

We will close this task for now. We may need to revisit it later.

> E2E support HoodieSparkRecord 
> --
>
> Key: HUDI-6907
> URL: https://issues.apache.org/jira/browse/HUDI-6907
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> As title.
>  
> We have confirmed that the `HoodieSparkRecord` is written to the disk.
> But we haven't confirmed if the `HoodieSparkRecord` has been transformed 
> during the process, though I have traced into the execution and did not find 
> any transformation happened. But I cannot guarantee it did not happend at 
> all. We need a better way to confirm that. But it should be OK for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6907) E2E support HoodieSparkRecord

2023-10-03 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6907:
--
Description: 
As title.

 

We have confirmed that the `HoodieSparkRecord` is written to the disk.

But we haven't confirmed if the `HoodieSparkRecord` has been transformed during 
the process, though I have traced into the execution and did not find any 
transformation happened. But I cannot guarantee it did not happend at all. We 
need a better way to confirm that. But it should be OK for now.

  was:As title.


> E2E support HoodieSparkRecord 
> --
>
> Key: HUDI-6907
> URL: https://issues.apache.org/jira/browse/HUDI-6907
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> As title.
>  
> We have confirmed that the `HoodieSparkRecord` is written to the disk.
> But we haven't confirmed if the `HoodieSparkRecord` has been transformed 
> during the process, though I have traced into the execution and did not find 
> any transformation happened. But I cannot guarantee it did not happend at 
> all. We need a better way to confirm that. But it should be OK for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745488652

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * 1a11ff678d2345105879a6faa951c18d94dfa1ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] NotSerializableException using SparkRDDWriteClient with OCC and DynamoDBBasedLockProvider [hudi]

2023-10-03 Thread via GitHub


ehurheap commented on issue #9807:
URL: https://github.com/apache/hudi/issues/9807#issuecomment-1745464144

   Yes. I can write a dataframe to the same table, for example:
   ```
   data.write
 .format("org.apache.hudi.Spark32PlusDefaultSource")
 .options(writeWithLocking)
 .mode("append")
 .save(tablePath)
   ```
   where writeWithLocking options are:
   ```
   (hoodie.bulkinsert.shuffle.parallelism,2)
   (hoodie.bulkinsert.sort.mode,NONE)
   (hoodie.clean.async,false)
   (hoodie.clean.automatic,false)
   (hoodie.cleaner.policy.failed.writes,LAZY)
   (hoodie.combine.before.insert,false)
   (hoodie.compact.inline,false)
   (hoodie.compact.schedule.inline,false)
   (hoodie.datasource.compaction.async.enable,false)
   (hoodie.datasource.write.hive_style_partitioning,true)
   
(hoodie.datasource.write.keygenerator.class,org.apache.spark.sql.hudi.command.UuidKeyGenerator)
   (hoodie.datasource.write.operation,bulk_insert)
   (hoodie.datasource.write.partitionpath.field,env_id,week)
   (hoodie.datasource.write.precombine.field,schematized_at)
   (hoodie.datasource.write.recordkey.field,env_id,user_id)
   (hoodie.datasource.write.row.writer.enable,false)
   (hoodie.datasource.write.table.type,MERGE_ON_READ)
   (hoodie.metadata.enable,false)
   (hoodie.table.name,users_changes)
   (hoodie.write.concurrency.mode,OPTIMISTIC_CONCURRENCY_CONTROL)
   (hoodie.write.lock.dynamodb.endpoint_url,http://localhost:8000)
   (hoodie.write.lock.dynamodb.partition_key,users_changes-us-east-1-local)
   (hoodie.write.lock.dynamodb.region,us-east-1)
   (hoodie.write.lock.dynamodb.table,datalake-locks)
   
(hoodie.write.lock.provider,org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider)
   ```
   
   These locking configs are also in our production ingestion which writes to 
hudi using spark structured streaming without error. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6796:
--
Status: Patch Available  (was: In Progress)

> Implement position-based deletes in FileGroupReader
> ---
>
> Key: HUDI-6796
> URL: https://issues.apache.org/jira/browse/HUDI-6796
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6797) Implement position-based updates in FileGroupReader

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6797:
-

Assignee: Sagar Sumit

> Implement position-based updates in FileGroupReader
> ---
>
> Key: HUDI-6797
> URL: https://issues.apache.org/jira/browse/HUDI-6797
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6797) Implement position-based updates in FileGroupReader

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6797:
--
Status: In Progress  (was: Open)

> Implement position-based updates in FileGroupReader
> ---
>
> Key: HUDI-6797
> URL: https://issues.apache.org/jira/browse/HUDI-6797
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader

2023-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6796:
-
Labels: pull-request-available  (was: )

> Implement position-based deletes in FileGroupReader
> ---
>
> Key: HUDI-6796
> URL: https://issues.apache.org/jira/browse/HUDI-6796
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6796][WIP] Use position-based deletes in FileGroupReader [hudi]

2023-10-03 Thread via GitHub


codope opened a new pull request, #9818:
URL: https://github.com/apache/hudi/pull/9818

   ### Change Logs
   
   Stacked on top of #9581 
   
   Main changes in this PR:
   
   - Add a new implementation `HoodiePositionBasedMergedLogRecordReader` that 
uses record positions to do merging.
   - Add following methods in `BaseHoodieLogRecordReader`:  
`processNextDeletePosition(long position)` and `processNextRecord(T record, 
Map metadata, Option position)`. These are used for 
position based merging. Positions are available from the log block headers.
   
   ### Impact
   
   Improved log record reader performance, better than key-based merging.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   The new reader is used only when `shouldUseRecordPositions` is set to true.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6796) Implement position-based deletes in FileGroupReader

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6796:
--
Status: In Progress  (was: Open)

> Implement position-based deletes in FileGroupReader
> ---
>
> Key: HUDI-6796
> URL: https://issues.apache.org/jira/browse/HUDI-6796
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6796) Implement position-based deletes in FileGroupReader

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6796:
-

Assignee: Sagar Sumit

> Implement position-based deletes in FileGroupReader
> ---
>
> Key: HUDI-6796
> URL: https://issues.apache.org/jira/browse/HUDI-6796
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745398875

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * 20921d8e6034d38eaefb739c021f4324b94f6803 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20207)
 
   * 1a11ff678d2345105879a6faa951c18d94dfa1ba Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]

2023-10-03 Thread via GitHub


linliu-code commented on code in PR #9809:
URL: https://github.com/apache/hudi/pull/9809#discussion_r1344415595


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+
+  /**
+   * In some cases a business logic does some checks before flushing a merged 
record to the disk.
+   * This method does the check and the returned value contains two boolean 
variables.
+   * 
+   * The first variable indicates if the merged record should be flushed to 
the disk or not.
+   * The second variable takes effect only when the first one is false, and it 
indicates if
+   * the old record should be kept or not. That is,
+   * (1) (true, _):   the merged one is flushed to the disk; the old record is 
skipped.
+   * (2) (false, false):  both records skipped, a delete operation.
+   * (3) (false, true):   only the old record flushed to the disk.
+   *
+   * @param record  the merged record.
+   * @param schema  the schema of the merged record.
+   * @return a pair of boolean variables to indicate the flush decision.
+   *
+   *  This interface is experimental and might be evolved in the future.
+   **/
+  default Pair shouldFlush(HoodieRecord record, Schema 
schema, TypedProperties props) throws IOException {

Review Comment:
   @danny0405 , saw your comment in the slack, we can use the simple signature 
right now since we can evolve it when we need in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]

2023-10-03 Thread via GitHub


linliu-code commented on code in PR #9809:
URL: https://github.com/apache/hudi/pull/9809#discussion_r1344411636


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+
+  /**
+   * In some cases a business logic does some checks before flushing a merged 
record to the disk.
+   * This method does the check and the returned value contains two boolean 
variables.
+   * 
+   * The first variable indicates if the merged record should be flushed to 
the disk or not.
+   * The second variable takes effect only when the first one is false, and it 
indicates if
+   * the old record should be kept or not. That is,
+   * (1) (true, _):   the merged one is flushed to the disk; the old record is 
skipped.
+   * (2) (false, false):  both records skipped, a delete operation.
+   * (3) (false, true):   only the old record flushed to the disk.
+   *
+   * @param record  the merged record.
+   * @param schema  the schema of the merged record.
+   * @return a pair of boolean variables to indicate the flush decision.
+   *
+   *  This interface is experimental and might be evolved in the future.
+   **/
+  default Pair shouldFlush(HoodieRecord record, Schema 
schema, TypedProperties props) throws IOException {

Review Comment:
   > > This question could be very critical,
   > 
   > I didn't see such request from any user, even for the contributor from 
Kuaishou, they just want to keep the merged record or drop it totally. Let's 
not introduce new semantics if there is no real use case as back-up.
   > 
   > We can evolve the returned value as a `Pair` or `Enum` if there are more 
feedbacks, at this time point, the behavior for keeping the old record seems 
not clear to me.
   
   Even in current implementation of `HoodieMergeHandle`, we are still facing 
this problem: when the `shouldFlush` function returns false,  should we return 
true or false in `writeRecord` function? Returning true means skipping the old 
record, false means keeping the old record. No matter which one we choose in 
advance, we still face the possible situation: what if a user wants the other 
way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745336244

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206)
 
   * 20921d8e6034d38eaefb739c021f4324b94f6803 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20207)
 
   * 1a11ff678d2345105879a6faa951c18d94dfa1ba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1745317180

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206)
 
   * 20921d8e6034d38eaefb739c021f4324b94f6803 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] cleaner blocked due to HoodieRollbackException and FileAlreadyExistsException [hudi]

2023-10-03 Thread via GitHub


ehurheap commented on issue #9796:
URL: https://github.com/apache/hudi/issues/9796#issuecomment-1745307682

   Just noting that targeting fewer commits per cleaner run was successful and 
the cleaner has completed successfully for several runs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hoodie MAGIC was written twice to a log file [hudi]

2023-10-03 Thread via GitHub


dat-vikash commented on issue #8887:
URL: https://github.com/apache/hudi/issues/8887#issuecomment-1745309477

   Bump on this. Also observing this issue with 0.13.1 and flink 1.16.1 . The 
flink job continues to fail unless we manually delete those compaction.requests 
files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6892) ExternalSpillableMap may cause data duplication when flink compaction

2023-10-03 Thread Linleicheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linleicheng closed HUDI-6892.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

> ExternalSpillableMap may cause data duplication when flink compaction
> -
>
> Key: HUDI-6892
> URL: https://issues.apache.org/jira/browse/HUDI-6892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Linleicheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> reproduce:
> 1、fullfill in-memory map with records, and let this.inMemoryMap.size() % 
> NUMBER_OF_RECORDS_TO_ESTIMATE_PAYLOAD_SIZE == 0
> 2、insert a record with key1 into ExternalSpillableMap (which will cause size 
> estimate and make sure the currentInMemoryMapSize is still greater than or 
> equal to the maxInMemorySizeInBytes).
>    it will be spilled to disk. 
> 3、Reduce the size of record of key1 which will make the 
> currentInMemoryMapSize less than maxInMemorySizeInBytes when put into 
> ExternalSpillableMap
>    it will be put into in-memory map.
>    
> data duplication when iterator finally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6892) ExternalSpillableMap may cause data duplication when flink compaction

2023-10-03 Thread Linleicheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linleicheng updated HUDI-6892:
--
Affects Version/s: (was: 0.14.0)
 Priority: Critical  (was: Major)

> ExternalSpillableMap may cause data duplication when flink compaction
> -
>
> Key: HUDI-6892
> URL: https://issues.apache.org/jira/browse/HUDI-6892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Linleicheng
>Priority: Critical
>  Labels: pull-request-available
>
> reproduce:
> 1、fullfill in-memory map with records, and let this.inMemoryMap.size() % 
> NUMBER_OF_RECORDS_TO_ESTIMATE_PAYLOAD_SIZE == 0
> 2、insert a record with key1 into ExternalSpillableMap (which will cause size 
> estimate and make sure the currentInMemoryMapSize is still greater than or 
> equal to the maxInMemorySizeInBytes).
>    it will be spilled to disk. 
> 3、Reduce the size of record of key1 which will make the 
> currentInMemoryMapSize less than maxInMemorySizeInBytes when put into 
> ExternalSpillableMap
>    it will be put into in-memory map.
>    
> data duplication when iterator finally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT]hudi[0.13.1] on flink[1.16.2], after bulk_insert & bucket_index, get int96 exception when flink trigger compaction [hudi]

2023-10-03 Thread via GitHub


li-ang-666 commented on issue #9804:
URL: https://github.com/apache/hudi/issues/9804#issuecomment-1745007844

   > parquet.avro.readInt96AsFixed
   
   now I changed the pom to version-0.14.0, but how could i use this 
option(parquet.avro.readInt96AsFixed) in online compaction?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] commits_.archive is not move to archived folder [hudi]

2023-10-03 Thread via GitHub


njalan commented on issue #9812:
URL: https://github.com/apache/hudi/issues/9812#issuecomment-1744988210

   metadata is not enable on hudi 0.9. I listed the details for one table in 
ticket https://github.com/apache/hudi/issues/9751. There are like around 10% of 
my tables has archive commits files in archived folder. I am using default 
parameter.  But I didn't face any data issues for these tables without new 
archiving commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] AWS Glue Sync bug with "delete_partition" operation [hudi]

2023-10-03 Thread via GitHub


noahtaite commented on issue #9805:
URL: https://github.com/apache/hudi/issues/9805#issuecomment-1744956625

   Hey @ad1happy2go thank you for the response. I will give that experiment a 
try in my dev environment today and let you know.
   
   I will do the following and let you know the result:
   - Bulk insert table with multiple partitions 
(datasource=[1-2],year=[2000-2023],month=[1-9])
   - Run delete_partition on datasource=1/*
   - Run glue sync. Verify partitions are removed from Glue.
   - Re-ingest good partitions datasource=1,year=[2010-2023],month=[1-10]
   - Run glue sync. Hopefully partitions can be added back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9776:
URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744851626

   
   ## CI report:
   
   * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN
   * 268d48b2f47310fd490b80052eff3a5d01aea5c9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744751092

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path

2023-10-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6795:
--
Reviewers: Lin Liu, Vinoth Chandar  (was: sivabalan narayanan, Vinoth 
Chandar)

> Implement generation of record_positions for updates and deletes on write path
> --
>
> Key: HUDI-6795
> URL: https://issues.apache.org/jira/browse/HUDI-6795
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744672834

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204)
 
   * b1581950ca129d2753c53ec77c0bf046701b2c92 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9776:
URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744658477

   
   ## CI report:
   
   * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN
   * 06eb344da3e6c4ce270a2b63e34908a507aac786 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20202)
 
   * 268d48b2f47310fd490b80052eff3a5d01aea5c9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744656800

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204)
 
   * b1581950ca129d2753c53ec77c0bf046701b2c92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6642] Use completion time based file slicing [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9776:
URL: https://github.com/apache/hudi/pull/9776#issuecomment-1744589641

   
   ## CI report:
   
   * 6b730068fa6ca60dfdd04f720334a49fa19a8b31 UNKNOWN
   * 06eb344da3e6c4ce270a2b63e34908a507aac786 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20202)
 
   * 268d48b2f47310fd490b80052eff3a5d01aea5c9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]

2023-10-03 Thread via GitHub


danny0405 commented on code in PR #9809:
URL: https://github.com/apache/hudi/pull/9809#discussion_r1343793710


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+
+  /**
+   * In some cases a business logic does some checks before flushing a merged 
record to the disk.
+   * This method does the check and the returned value contains two boolean 
variables.
+   * 
+   * The first variable indicates if the merged record should be flushed to 
the disk or not.
+   * The second variable takes effect only when the first one is false, and it 
indicates if
+   * the old record should be kept or not. That is,
+   * (1) (true, _):   the merged one is flushed to the disk; the old record is 
skipped.
+   * (2) (false, false):  both records skipped, a delete operation.
+   * (3) (false, true):   only the old record flushed to the disk.
+   *
+   * @param record  the merged record.
+   * @param schema  the schema of the merged record.
+   * @return a pair of boolean variables to indicate the flush decision.
+   *
+   *  This interface is experimental and might be evolved in the future.
+   **/
+  default Pair shouldFlush(HoodieRecord record, Schema 
schema, TypedProperties props) throws IOException {

Review Comment:
   > This question could be very critical,
   
   I didn't see such request from any user, even for the contributor from 
Kuaishou, they just want to keep the merged record or drop it totally. Let's 
not introduce new semantics if there is no real use case as back-up.
   
   We can evolve the returned value as a `Pair` or `Enum` if there are more 
feedbacks, at this time point, the behavior for keeping the old record seems 
not clear to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6702][RFC-46] Support customized logic [hudi]

2023-10-03 Thread via GitHub


linliu-code commented on code in PR #9809:
URL: https://github.com/apache/hudi/pull/9809#discussion_r1343534449


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -46,6 +46,28 @@ public interface HoodieRecordMerger extends Serializable {
*/
   Option> merge(HoodieRecord older, Schema 
oldSchema, HoodieRecord newer, Schema newSchema, TypedProperties props) throws 
IOException;
 
+
+  /**
+   * In some cases a business logic does some checks before flushing a merged 
record to the disk.
+   * This method does the check and the returned value contains two boolean 
variables.
+   * 
+   * The first variable indicates if the merged record should be flushed to 
the disk or not.
+   * The second variable takes effect only when the first one is false, and it 
indicates if
+   * the old record should be kept or not. That is,
+   * (1) (true, _):   the merged one is flushed to the disk; the old record is 
skipped.
+   * (2) (false, false):  both records skipped, a delete operation.
+   * (3) (false, true):   only the old record flushed to the disk.
+   *
+   * @param record  the merged record.
+   * @param schema  the schema of the merged record.
+   * @return a pair of boolean variables to indicate the flush decision.
+   *
+   *  This interface is experimental and might be evolved in the future.
+   **/
+  default Pair shouldFlush(HoodieRecord record, Schema 
schema, TypedProperties props) throws IOException {

Review Comment:
   > I kind of agree, we can simplify the returned value as a true/false. But 
maybe @linliu-code has some other considerations here, @linliu-code can you 
clarify.
   
   @danny0405, @codope, the reason that we need a pair of boolean variables is 
that:  if a merger decides not to flush the combined record, it faces the 
question if the old record (the record in the base file) should be kept or not. 
This question could be very critical, so we should not guess it for the 
developer who implements their custom merger.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744573109

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * a3ece047efd3ef23736d5162211c4176f31468e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20204)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744555123

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * b71be9bbde91fb6afeb52dfc102ba789963b66ff Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20203)
 
   * a3ece047efd3ef23736d5162211c4176f31468e1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]

2023-10-03 Thread via GitHub


ad1happy2go commented on issue #9751:
URL: https://github.com/apache/hudi/issues/9751#issuecomment-1744541419

   @njalan Do you also see similar behaviour for the tables which got written 
with later versions of hudi (0.13) only and not 0.9.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Is this the expected number of S3 calls? [hudi]

2023-10-03 Thread via GitHub


ad1happy2go commented on issue #9612:
URL: https://github.com/apache/hudi/issues/9612#issuecomment-1744535879

   There are quite a few after 0.11. Examples  - 
   https://github.com/apache/hudi/pull/7404
   https://github.com/apache/hudi/pull/7436
   https://github.com/apache/hudi/pull/7404


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] NotSerializableException using SparkRDDWriteClient with OCC and DynamoDBBasedLockProvider [hudi]

2023-10-03 Thread via GitHub


ad1happy2go commented on issue #9807:
URL: https://github.com/apache/hudi/issues/9807#issuecomment-1744517639

   @ehurheap Did you tried the same lock configuration with a normal insert on 
test table to ensure configurations are good?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1744462777

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587)
 
   * b71be9bbde91fb6afeb52dfc102ba789963b66ff Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20203)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

2023-10-03 Thread via GitHub


hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-173328

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587)
 
   * b71be9bbde91fb6afeb52dfc102ba789963b66ff UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org