Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2088019252

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * da5bbcce94223f796d6e40c2a20daeff43794993 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23588)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4372] Enable matadata table by default for flink [hudi]

2024-04-30 Thread via GitHub


danny0405 commented on code in PR #11124:
URL: https://github.com/apache/hudi/pull/11124#discussion_r1585900571


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -554,8 +554,7 @@ protected void postCommit(HoodieTable table, 
HoodieCommitMetadata metadata, Stri
*/
   protected void mayBeCleanAndArchive(HoodieTable table) {
 autoCleanOnCommit();
-// reload table to that timeline reflects the clean commit
-autoArchiveOnCommit(createTable(config, hadoopConf));
+autoArchiveOnCommit(table);

Review Comment:
   This is unnecessary, we just need to refresh the table metadata active 
timeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087986187

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * 86f618e91d63ed5da3b16dbe5e71c00e5546e8cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23585)
 
   * da5bbcce94223f796d6e40c2a20daeff43794993 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23588)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4372] Enable matadata table by default for flink [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11124:
URL: https://github.com/apache/hudi/pull/11124#issuecomment-2087986151

   
   ## CI report:
   
   * 33909835f589e444771c8c9c6e5bdec15785e397 UNKNOWN
   * 13d4b2235ffd4671b6573996b0f7ac3052226ad0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23586)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2087985985

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * de51f5efb052c32725b5eeb97773133d8c98498f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356)
 
   * b0a72b4e20ebaa87727b44642d5d11f5f78021bf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23587)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087981370

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * 86f618e91d63ed5da3b16dbe5e71c00e5546e8cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23585)
 
   * da5bbcce94223f796d6e40c2a20daeff43794993 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2087981153

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * de51f5efb052c32725b5eeb97773133d8c98498f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356)
 
   * b0a72b4e20ebaa87727b44642d5d11f5f78021bf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] The Hive run_sync_tool's Logged Command & The Actual Command Do Not Match [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11029:
URL: https://github.com/apache/hudi/issues/11029#issuecomment-2087977490

   @samserpoosh Were you able to work on this PR. Do let us know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Recovering job from checkpoint, reporting NoSuchElementException and data exception [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11023:
URL: https://github.com/apache/hudi/issues/11023#issuecomment-2087976183

   @jack1234smith Did you able to figure out the issue here? Please let us know 
in case you still need help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2087975711

   @juice411 Do you have any other help on this. Please let us know if you are 
good. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark job relying over Hudi are blocked after one or zero commit [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11011:
URL: https://github.com/apache/hudi/issues/11011#issuecomment-2087975030

   @pontisa95 Were you able to get it resolved? If yes, Please let us know the 
issue and resolution or let us know in case you still need help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2087971292

   @liangchen-datanerd That's the good suggestion. Created a tracking JIRA too 
- https://issues.apache.org/jira/browse/HUDI-7698
   
   We can think of introducing the reader side config which can enable so. We 
have original data in parquet files so should not be the challenge. Feel free 
to contribute if you are interested. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7698) Introduce config to Return the original partition value from parquet when using CustomKeyGenerator

2024-04-30 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-7698:
---

 Summary: Introduce config to Return the original partition value 
from parquet when using CustomKeyGenerator
 Key: HUDI-7698
 URL: https://issues.apache.org/jira/browse/HUDI-7698
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: Aditya Goenka
 Fix For: 1.1.0


Github Issue - [https://github.com/apache/hudi/issues/11002]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7697) Add branch protection in GitHub

2024-04-30 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7697:
---

 Summary: Add branch protection in GitHub
 Key: HUDI-7697
 URL: https://issues.apache.org/jira/browse/HUDI-7697
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7697) Add branch protection in GitHub

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7697:

Description: Only allow PR merging when all CI pass.

> Add branch protection in GitHub
> ---
>
> Key: HUDI-7697
> URL: https://issues.apache.org/jira/browse/HUDI-7697
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> Only allow PR merging when all CI pass.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7473) Rebalance CI

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7473:
---

Assignee: Ethan Guo

> Rebalance CI
> 
>
> Key: HUDI-7473
> URL: https://issues.apache.org/jira/browse/HUDI-7473
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7473) Rebalance CI

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7473:

Epic Link: HUDI-4302

> Rebalance CI
> 
>
> Key: HUDI-7473
> URL: https://issues.apache.org/jira/browse/HUDI-7473
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7473) Rebalance CI

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7473.
---
Resolution: Fixed

> Rebalance CI
> 
>
> Key: HUDI-7473
> URL: https://issues.apache.org/jira/browse/HUDI-7473
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7473) Rebalance CI

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7473:

Fix Version/s: 0.15.0
   1.0.0

> Rebalance CI
> 
>
> Key: HUDI-7473
> URL: https://issues.apache.org/jira/browse/HUDI-7473
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-4372] Enable matadata table by default for flink [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11124:
URL: https://github.com/apache/hudi/pull/11124#issuecomment-2087946878

   
   ## CI report:
   
   * 33909835f589e444771c8c9c6e5bdec15785e397 UNKNOWN
   * d18ce474faa16547a8969cd56f67dfed5b80891a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23573)
 
   * 13d4b2235ffd4671b6573996b0f7ac3052226ad0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23586)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4372] Enable matadata table by default for flink [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11124:
URL: https://github.com/apache/hudi/pull/11124#issuecomment-2087942409

   
   ## CI report:
   
   * 33909835f589e444771c8c9c6e5bdec15785e397 UNKNOWN
   * d18ce474faa16547a8969cd56f67dfed5b80891a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23573)
 
   * 13d4b2235ffd4671b6573996b0f7ac3052226ad0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087938312

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * 86f618e91d63ed5da3b16dbe5e71c00e5546e8cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23585)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6712:
-
Status: Open  (was: Patch Available)

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087911762

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23582)
 
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * 86f618e91d63ed5da3b16dbe5e71c00e5546e8cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23585)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087906539

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23582)
 
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   * 86f618e91d63ed5da3b16dbe5e71c00e5546e8cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087901993

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23582)
 
   * b1598f5861c2b90da91ad33dc360533728ef7163 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (a29fe277df8 -> f553ba25fe3)

2024-04-30 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from a29fe277df8 [HUDI-7694] Unify bijection-avro dependency version 
(#11132)
 add f553ba25fe3 [HUDI-7144] Build storage partition stats index and use it 
for data skipping (#10352)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   8 +
 .../metadata/HoodieBackedTableMetadataWriter.java  |  87 ++---
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   3 +-
 .../hudi/common/config/HoodieMetadataConfig.java   |  48 +
 .../org/apache/hudi/common/util/BaseFileUtils.java |  54 ++
 .../org/apache/hudi/common/util/ParquetUtils.java  |  19 +-
 .../hudi/metadata/HoodieMetadataPayload.java   |  43 -
 .../hudi/metadata/HoodieTableMetadataUtil.java | 184 +-
 .../hudi/metadata/MetadataPartitionType.java   |  11 ++
 .../apache/hudi/common/util/TestBaseFileUtils.java |  84 
 .../hudi/metadata/TestHoodieMetadataPayload.java   |  45 +
 .../hudi/metadata/TestHoodieTableMetadataUtil.java |  74 +++
 .../hudi/metadata/TestMetadataPartitionType.java   |   4 +
 .../scala/org/apache/hudi/HoodieFileIndex.scala|   1 +
 .../apache/hudi/PartitionStatsIndexSupport.scala   |  65 +++
 .../analysis/HoodiePruneFileSourcePartitions.scala |   4 +-
 .../TestDataSkippingWithMORColstats.java   |   3 +
 .../org/apache/hudi/TestHoodieSparkUtils.scala |  10 +
 ...ase.scala => PartitionStatsIndexTestBase.scala} | 178 -
 .../hudi/functional/TestPartitionStatsIndex.scala  | 214 +
 .../TestPartitionStatsIndexWithSql.scala   | 106 ++
 .../hudi/dml/TestHoodieTableValuedFunction.scala   |  69 ++-
 22 files changed, 1120 insertions(+), 194 deletions(-)
 create mode 100644 
hudi-common/src/test/java/org/apache/hudi/common/util/TestBaseFileUtils.java
 create mode 100644 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionStatsIndexSupport.scala
 copy 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/{RecordLevelIndexTestBase.scala
 => PartitionStatsIndexTestBase.scala} (58%)
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


codope merged PR #10352:
URL: https://github.com/apache/hudi/pull/10352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6700) Archiving should be time based, not this min-max and not per instant. Lets treat it like a log (Phase 2)

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6700:
-
Status: Open  (was: In Progress)

> Archiving should be time based, not this min-max and not per instant. Lets 
> treat it like a log (Phase 2)
> 
>
> Key: HUDI-6700
> URL: https://issues.apache.org/jira/browse/HUDI-6700
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch asf-site updated: [DOCS] Add Daft read example (#11133)

2024-04-30 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6c29efa83d9 [DOCS] Add Daft read example (#11133)
6c29efa83d9 is described below

commit 6c29efa83d95f46206510c32ac050f93c3e10e60
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Apr 30 21:08:16 2024 -0500

[DOCS] Add Daft read example (#11133)
---
 website/docs/reading_tables_batch_reads.md | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/website/docs/reading_tables_batch_reads.md 
b/website/docs/reading_tables_batch_reads.md
index 2055a9685df..d247fd4c3d0 100644
--- a/website/docs/reading_tables_batch_reads.md
+++ b/website/docs/reading_tables_batch_reads.md
@@ -18,3 +18,18 @@ val tripsDF = spark.read.
 load(basePath)
 tripsDF.where(tripsDF.fare > 20.0).show()
 ```
+
+## Daft
+
+[Daft](https://www.getdaft.io/) supports reading Hudi tables using 
`daft.read_hudi()` function.
+
+```Python
+# Read Apache Hudi table into a Daft DataFrame.
+import daft
+
+df = daft.read_hudi("some-table-uri")
+df = df.where(df["foo"] > 5)
+df.show()
+```
+
+Check out the Daft docs for [Hudi 
integration](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/hudi.html).



Re: [PR] [DOCS] Add Daft read example [hudi]

2024-04-30 Thread via GitHub


xushiyan merged PR #11133:
URL: https://github.com/apache/hudi/pull/11133


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [DOCS] Add Daft read example [hudi]

2024-04-30 Thread via GitHub


xushiyan opened a new pull request, #11133:
URL: https://github.com/apache/hudi/pull/11133

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087861615

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23582)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4372) Enable matadata table by default for flink

2024-04-30 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4372:
-
Sprint: Sprint 2023-04-26

> Enable matadata table by default for flink
> --
>
> Key: HUDI-4372
> URL: https://issues.apache.org/jira/browse/HUDI-4372
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, metadata
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4372) Enable matadata table by default for flink

2024-04-30 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4372:
-
Status: In Progress  (was: Reopened)

> Enable matadata table by default for flink
> --
>
> Key: HUDI-4372
> URL: https://issues.apache.org/jira/browse/HUDI-4372
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, metadata
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (f99b181a04e -> a29fe277df8)

2024-04-30 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from f99b181a04e [HUDI-7588] Replace hadoop Configuration with 
StorageConfiguration in meta client (#11071)
 add a29fe277df8 [HUDI-7694] Unify bijection-avro dependency version 
(#11132)

No new revisions were added by this update.

Summary of changes:
 hudi-examples/hudi-examples-flink/pom.xml | 2 +-
 hudi-utilities/pom.xml| 2 +-
 packaging/hudi-integ-test-bundle/pom.xml  | 2 +-
 pom.xml   | 1 +
 4 files changed, 4 insertions(+), 3 deletions(-)



Re: [PR] [HUDI-7694] Unify bijection-avro dependency version [hudi]

2024-04-30 Thread via GitHub


yihua merged PR #11132:
URL: https://github.com/apache/hudi/pull/11132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585778411


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala:
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.model.WriteOperationType
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.testutils.HoodieTestUtils
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+import org.junit.jupiter.api.Tag
+
+import scala.collection.JavaConverters._
+
+@Tag("functional")
+class TestPartitionStatsIndexWithSql extends HoodieSparkSqlTestBase {
+
+  val sqlTempTable = "hudi_tbl"
+
+  test("Test partition stats index following insert, merge into, update and 
delete") {
+withTempDir { tmp =>
+  val tableName = generateTableName
+  val tablePath = s"${tmp.getCanonicalPath}/$tableName"
+  // Create table with date type partition
+  spark.sql(
+s"""
+   | create table $tableName using hudi
+   | partitioned by (dt)
+   | tblproperties(
+   |primaryKey = 'id',
+   |preCombineField = 'ts',
+   |'hoodie.metadata.index.partition.stats.enable' = 'true'
+   | )
+   | location '$tablePath'
+   | AS
+   | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
+ """.stripMargin
+  )
+
+  assertResult(WriteOperationType.BULK_INSERT) {
+HoodieSparkSqlTestBase.getLastCommitMetadata(spark, 
tablePath).getOperationType
+  }
+  checkAnswer(s"select id, name, price, ts, cast(dt as string) from 
$tableName")(
+Seq(1, "a1", 10, 1000, "2021-05-06")
+  )
+
+  val partitionValue = "2021-05-06"
+
+  // Check the missing properties for spark sql
+  val metaClient = HoodieTableMetaClient.builder()
+.setBasePath(tablePath)
+.setConf(HoodieTestUtils.getDefaultStorageConf)
+.build()
+  val properties = metaClient.getTableConfig.getProps.asScala.toMap
+  
assertResult(true)(properties.contains(HoodieTableConfig.CREATE_SCHEMA.key))
+  assertResult("dt")(properties(HoodieTableConfig.PARTITION_FIELDS.key))
+  assertResult("ts")(properties(HoodieTableConfig.PRECOMBINE_FIELD.key))
+  assertResult(tableName)(metaClient.getTableConfig.getTableName)
+
+  // Test insert into
+  spark.sql(s"insert into $tableName values(2, 'a2', 10, 1000, 
cast('$partitionValue' as date))")
+  checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(

Review Comment:
   I have it in two seprate tests - `testPartitionStatsWithPartitionBy` and 
`testPartitionStatsIndexFilePruning`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585777410


##
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##
@@ -67,6 +70,61 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
 throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileColumnRanges List of column range statistics for each file in 
a partition
+   */
+  public static > HoodieColumnRangeMetadata 
getColumnRangeInPartition(@Nonnull List> 
fileColumnRanges) {
+if (fileColumnRanges.size() == 1) {
+  // Only one parquet file, we can just return that range.
+  return fileColumnRanges.get(0);
+}
+// There are multiple files. Compute min(file_mins) and max(file_maxs)
+return fileColumnRanges.stream()
+.sequential()
+.reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static > HoodieColumnRangeMetadata 
mergeRanges(HoodieColumnRangeMetadata one,

Review Comment:
   We follow the same approach as for colstats - 
https://github.com/apache/hudi/blob/f99b181a04e2ee549009d38deca7597219c4ce46/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L372
   If the new record is for delete, then we ignore the previous one. If 
previous record is for delete, then we take the newer one, otherwise just merge 
the stats. I don't think we do event-time based merging for MDT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7696) Consolidate convertFilesToPartitionStatsRecords and convertMetadataToPartitionStatsRecords

2024-04-30 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7696:
-

 Summary: Consolidate convertFilesToPartitionStatsRecords and 
convertMetadataToPartitionStatsRecords
 Key: HUDI-7696
 URL: https://issues.apache.org/jira/browse/HUDI-7696
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit
 Fix For: 1.0.0


https://github.com/apache/hudi/pull/10352#discussion_r1584149612



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585774319


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1872,4 +1883,175 @@ public HoodieRecord next() {
   }
 };
   }
+
+  public static HoodieData 
convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,
+ 
List partitionInfoList,
+ 
HoodieMetadataConfig metadataConfig,
+ 
HoodieTableMetaClient dataTableMetaClient) {
+final List columnsToIndex = 
metadataConfig.getColumnsEnabledForColumnStatsIndex();
+if (columnsToIndex.isEmpty()) {
+  return engineContext.emptyHoodieData();
+}
+LOG.debug("Indexing following columns for partition stats index: {}", 
columnsToIndex);
+// Create records for MDT
+int parallelism = Math.max(Math.min(partitionInfoList.size(), 
metadataConfig.getPartitionStatsIndexParallelism()), 1);
+return engineContext.parallelize(partitionInfoList, 
parallelism).flatMap(partitionInfo -> {
+  final String partitionPath = partitionInfo.getRelativePath();
+  // Step 1: Collect Column Metadata for Each File (Your existing code 
does this)
+  List>> fileColumnMetadata = 
partitionInfo.getFileNameToSizeMap().keySet().stream()
+  .map(fileName -> getFileStatsRangeMetadata(partitionPath, 
partitionPath + "/" + fileName, dataTableMetaClient, columnsToIndex, false))
+  .collect(toList());
+  // Step 2: Flatten and Group by Column Name
+  Map>> 
columnMetadataMap = fileColumnMetadata.stream()
+  .flatMap(List::stream) // Flatten the list
+  
.collect(Collectors.groupingBy(HoodieColumnRangeMetadata::getColumnName, 
toList())); // Group by column name
+  // Step 3: Aggregate Column Ranges
+  Stream> 
partitionStatsRangeMetadata = columnMetadataMap.entrySet().stream()
+  .map(entry -> 
BaseFileUtils.getColumnRangeInPartition(entry.getValue()));
+  return HoodieMetadataPayload.createPartitionStatsRecords(partitionPath, 
partitionStatsRangeMetadata.collect(toList()), false).iterator();
+});
+  }
+
+  private static List> 
getFileStatsRangeMetadata(String partitionPath,
+   
String filePath,
+   
HoodieTableMetaClient datasetMetaClient,
+   
List columnsToIndex,
+   
boolean isDeleted) {
+String filePartitionPath = filePath.startsWith("/") ? 
filePath.substring(1) : filePath;
+String fileName = FSUtils.getFileName(filePath, partitionPath);
+if (isDeleted) {
+  return columnsToIndex.stream()
+  .map(entry -> HoodieColumnRangeMetadata.stub(fileName, entry))
+  .collect(Collectors.toList());
+}
+return readColumnRangeMetadataFrom(filePartitionPath, datasetMetaClient, 
columnsToIndex);
+  }
+
+  public static HoodieData 
convertMetadataToPartitionStatsRecords(HoodieCommitMetadata commitMetadata,

Review Comment:
   Ack, HUDI-7696 to track.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585773581


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1872,4 +1883,175 @@ public HoodieRecord next() {
   }
 };
   }
+
+  public static HoodieData 
convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,
+ 
List partitionInfoList,

Review Comment:
   Yes, this is just for initialization. We just log a warning and continue for 
other filegroups. This is what colstats is also doing. However, given that 
partition stats will be enabled by default, I think it makes sense to throw 
exception. Will do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7694] Unify bijection-avro dependency version [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11132:
URL: https://github.com/apache/hudi/pull/11132#issuecomment-2087801564

   
   ## CI report:
   
   * f5f72a318977302fc3828831c150f41690e2504c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23581)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087795216

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 695095976531b603d8d5712a8acc163eb1824f9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23579)
 
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23582)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087789246

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 695095976531b603d8d5712a8acc163eb1824f9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23579)
 
   * 07c2396a64d505633ac103cf2bcd4c6dc992fb81 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7694] Unify bijection-avro dependency version [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11132:
URL: https://github.com/apache/hudi/pull/11132#issuecomment-2087783819

   
   ## CI report:
   
   * f5f72a318977302fc3828831c150f41690e2504c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23581)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7694:

Status: Patch Available  (was: In Progress)

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7695) Add docs on Spark 3.5 and Scala 2.13

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7695:

Status: In Progress  (was: Open)

> Add docs on Spark 3.5 and Scala 2.13
> 
>
> Key: HUDI-7695
> URL: https://issues.apache.org/jira/browse/HUDI-7695
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7695) Add docs on Spark 3.5 and Scala 2.13

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7695:

Fix Version/s: 0.15.0
   1.0.0

> Add docs on Spark 3.5 and Scala 2.13
> 
>
> Key: HUDI-7695
> URL: https://issues.apache.org/jira/browse/HUDI-7695
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7695) Add docs on Spark 3.5 and Scala 2.13

2024-04-30 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7695:
---

 Summary: Add docs on Spark 3.5 and Scala 2.13
 Key: HUDI-7695
 URL: https://issues.apache.org/jira/browse/HUDI-7695
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7695) Add docs on Spark 3.5 and Scala 2.13

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7695:
---

Assignee: Ethan Guo

> Add docs on Spark 3.5 and Scala 2.13
> 
>
> Key: HUDI-7695
> URL: https://issues.apache.org/jira/browse/HUDI-7695
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7694] Unify bijection-avro dependency version [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11132:
URL: https://github.com/apache/hudi/pull/11132#issuecomment-2087744917

   
   ## CI report:
   
   * f5f72a318977302fc3828831c150f41690e2504c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7587] Make bundle dependencies for storage abstraction in correct order [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11131:
URL: https://github.com/apache/hudi/pull/11131#issuecomment-2087738011

   
   ## CI report:
   
   * 70e6f707c00ef7c84047c445a5c3be8b8aae2c75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23580)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7694:
-
Labels: pull-request-available  (was: )

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7694] Unify bijection-avro dependency version [hudi]

2024-04-30 Thread via GitHub


yihua opened a new pull request, #11132:
URL: https://github.com/apache/hudi/pull/11132

   ### Change Logs
   
   This PR unifies `bijection-avro` dependency version in the repo and upgrades 
the dependency version in `hudi-integ-test-bundle` (there is no reason to use a 
different version).  Also note that `bijection-avro:0.9.7` supports Scala 2.13, 
which makes Scala 2.13 support easier.
   
   ### Impact
   
   Dependency management improvement.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7694:
---

Assignee: Ethan Guo

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7694:

Fix Version/s: 0.15.0
   1.0.0

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7694:

Status: In Progress  (was: Open)

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7694:
---

 Summary: Unify bijection-avro dependency version
 Key: HUDI-7694
 URL: https://issues.apache.org/jira/browse/HUDI-7694
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7694) Unify bijection-avro dependency version

2024-04-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7694:

Story Points: 0.5

> Unify bijection-avro dependency version
> ---
>
> Key: HUDI-7694
> URL: https://issues.apache.org/jira/browse/HUDI-7694
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7587] Make bundle dependencies for storage abstraction in correct order [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11131:
URL: https://github.com/apache/hudi/pull/11131#issuecomment-2087732209

   
   ## CI report:
   
   * 70e6f707c00ef7c84047c445a5c3be8b8aae2c75 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7587) Move hadoop-dependent reader and writer implementation to hudi-hadoop-common module

2024-04-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7587:
-
Labels: hoodie-storage pull-request-available  (was: hoodie-storage)

> Move hadoop-dependent reader and writer implementation to hudi-hadoop-common 
> module
> ---
>
> Key: HUDI-7587
> URL: https://issues.apache.org/jira/browse/HUDI-7587
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7587] Make bundle dependencies for storage abstraction in correct order [hudi]

2024-04-30 Thread via GitHub


jonvex opened a new pull request, #11131:
URL: https://github.com/apache/hudi/pull/11131

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087675008

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * 695095976531b603d8d5712a8acc163eb1824f9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23579)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087657586

   
   ## CI report:
   
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * a8997fbab4049f052cbd1fe216a8cb5fe375c5d1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23578)
 
   * 695095976531b603d8d5712a8acc163eb1824f9b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087518121

   
   ## CI report:
   
   * 2ea5169f6e25c154748401c49ffd7d3177c50660 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23577)
 
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   * a8997fbab4049f052cbd1fe216a8cb5fe375c5d1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087489196

   
   ## CI report:
   
   * 2ea5169f6e25c154748401c49ffd7d3177c50660 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23577)
 
   * edf2bf30a2ddbd48db9452f34b1ac716bd2ebe18 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842466#comment-17842466
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:26 PM:
---

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) and a new "retain" command block - 
in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 
{code:java}
pointer data block {
   [
{fg1, logfile X, ..},
{fg2, logfile Y, ..},
{fg3, logfile Z, ..}
   ]
}
{code}
In this approach, instead of redistributing the records from file groups 
f1,f2,f3 to f4,f5 - we will
 * log pointer data blocks to f4, f5 , pointing back to new log files W wrote 
to file groups f1,f2,f3. (X,Y,Z in the example above) 
 * log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3 (with 
logfile z) to indicate to the cleaner service, that these log files are 
supposed to be retained for later, so it can be skipped.
 * When f4,f5 is compacted, these log files  will be reconciled to new file 
slices in f4, f5 
 * When the file slices in f4 and f5 are cleaned, the pointers are followed and 
the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some 
reference counting in case f4 is cleaned while f5 is not)

Note that there is a 1:N relationship between pointer data block and log files. 
For e.g both f4, f5's pointer data blocks will be pointing to all three files 
X,Y,Z. 

A snapshot read of the file groups f4, f5 need to carefully filter records to 
avoid exposing duplicate records to the reader. 


 * merges the log files pointed to by pointer data blocks based on keys i.e 
only delete/update records in the base file for keys that match from the 
pointed log files.
 * inserts need to be handled with care and need to be distinguishable from an 
update e.g a record in pointed log files for f4 can either be an insert or a 
update to a record that is now clustered into f5. 
 * storage format also needs to store a bitmap indicating which records are 
inserts vs updates in the pointed to log files. Once such a list of available, 
then the reader of f4/f5 can split the inserts amongst them, by using a hash 
mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file 
x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z



h3. Pros: 
 * Truly non-blocking, C and W can go at their own pace, complete without much 
additional overhead since the pointer blocks are pretty light to add. 
 * Works even for high throughput streaming scenarios where there is not much 
time or writer to be reconciling (e.g Flink)

h3. Cons:
 * More merge costs on the read/query side (although can be very tolerable for 
same cases approach 1 works well for)  
 * More complexity and new storage changes. 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
{fg1, logfileX, ..},
{fg2, logfileY, ..},
{fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted 

[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842466#comment-17842466
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:01 PM:
---

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
{fg1, logfileX, ..},
{fg2, logfileY, ..},
{fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, without 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1045:
-
Description: 
h4. We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
while a clustering service C  reclusters them into  f4, f5. 

Goals
 * Writes can be either updates, deletes or inserts. 
 * Either clustering C or the writer W can finish first
 * Both W and C need to be able to complete their actions without much redoing 
of work. 
 * The number of output file groups for C can be higher or lower than input 
file groups. 
 * Need to work across and be oblivious to whether the writers are operating in 
OCC or NBCC modes
 * Needs to interplay well with cleaning and compaction services.



h4. Non-goals 
 * Strictly the sort order achieved by clustering, in face of updates (e.g 
updates change clustering field values, causing output clustering file groups 
to be not fully sorted by those fields)

  was:
We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
while a clustering service C  reclusters them into  f4, f5. 
 * Writes can be either updates, deletes or inserts. 
 * Either clustering C or the writer W can finish first
 * Both W and C need to be able to complete their actions without much redoing 
of work. 
 * The number of output file groups for C can be higher or lower than input 
file groups. 
 * Need to work across and be oblivious to whether the writers are operating in 
OCC or NBCC modes
 * Needs to interplay well with cleaning and compaction services.


> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 

 

 

 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where 
 # Absorbs any read amplification. 

h3.  

Cons:
 # sort order may be disturbed from the re-distribtion of keys. 
 # Can be pretty wasteful, if 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-30 Thread via GitHub


TarunMootala commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2087065842

   `.hoodie/` fold is 350 MB and it has 3435 files (this includes active and 
archival timelines)
   `.hoodie/archived/` is 327 MB and it has 695 files (only archival timelines)
   
   Attached driver logs
   
[log-events-viewer-result.csv](https://github.com/apache/hudi/files/15170629/log-events-viewer-result.csv)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2087010994

   
   ## CI report:
   
   * 2ea5169f6e25c154748401c49ffd7d3177c50660 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23577)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11130:
URL: https://github.com/apache/hudi/pull/11130#issuecomment-2086970922

   
   ## CI report:
   
   * 2ea5169f6e25c154748401c49ffd7d3177c50660 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-6296] Add Scala 2.13 support for Spark 3.5 integration [hudi]

2024-04-30 Thread via GitHub


yihua opened a new pull request, #11130:
URL: https://github.com/apache/hudi/pull/11130

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6296) Add Scala 2.13 build profile to support scala 2.13

2024-04-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6296:
-
Labels: pull-request-available  (was: )

> Add Scala 2.13 build profile to support scala 2.13
> --
>
> Key: HUDI-6296
> URL: https://issues.apache.org/jira/browse/HUDI-6296
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Aditya Goenka
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7146] Implement secondary index [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11129:
URL: https://github.com/apache/hudi/pull/11129#issuecomment-2086577705

   
   ## CI report:
   
   * 0274004b842a332f57c1104de44e4e262ff2942d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23576)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]

2024-04-30 Thread via GitHub


sgcisco commented on issue #8:
URL: https://github.com/apache/hudi/issues/8#issuecomment-2086534021

   @ad1happy2go thanks for your reply. We tried `compact num.delta commits as 
1` in one of the tests for other runs and in what try to use now it is a 
default value which is 5.
   
   As another test attempt we tried to run a pipeline over several days but 
with lower ingestion rate 600Kb/s and the same Hudi and Spark configuration as 
above.
   
   The most time consuming stage is `Building workload profile` which takes 2.5 
- 12 min, with average around 7 min.
   
   ![Screenshot 2024-04-30 at 19 44 
00](https://github.com/apache/hudi/assets/168409126/ceb6353a-b90f-4abd-8111-5477338701d5)
   
   ![Screenshot 2024-04-30 at 20 37 
15](https://github.com/apache/hudi/assets/168409126/03b7fe99-7eba-4a24-b4b6-446a6b527c67)
   
   So in this case it is around 35-40Mb per minute, current Structured 
Streaming minibatch, and workers can go up to 35Gb and 32 cores. 
   Does it look as a sufficient resource config for Hudi to handle such load?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 6:32 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where 
 # Absorbs any read amplification. 

h3.  

Cons:
 # sort order may be disturbed from the re-distribtion of keys. 
 # Can be pretty wasteful, if 

 

 

 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where 
 # Absorbs any read amplification. 

h3.  

Cons:
 # sort order may be disturbed from the re-distribtion of keys. 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 6:32 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where 
 # Absorbs any read amplification. 

h3.  

Cons:
 # sort order may be disturbed from the re-distribtion of keys. 

 

 

 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 


{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # No storage format changes, 

h3. 

Cons:

 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 6:20 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 


{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # No storage format changes, 

h3. 

Cons:

 

 

 

 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585289704


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala:
##
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.common.model.WriteOperationType
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.testutils.HoodieTestUtils
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+import org.junit.jupiter.api.Tag
+
+import scala.collection.JavaConverters._
+
+@Tag("functional")
+class TestPartitionStatsIndexWithSql extends HoodieSparkSqlTestBase {
+
+  val sqlTempTable = "hudi_tbl"
+
+  test("Test partition stats index following insert, merge into, update and 
delete") {
+withTempDir { tmp =>
+  val tableName = generateTableName
+  val tablePath = s"${tmp.getCanonicalPath}/$tableName"
+  // Create table with date type partition
+  spark.sql(
+s"""
+   | create table $tableName using hudi
+   | partitioned by (dt)
+   | tblproperties(
+   |primaryKey = 'id',
+   |preCombineField = 'ts',
+   |'hoodie.metadata.index.partition.stats.enable' = 'true'
+   | )
+   | location '$tablePath'
+   | AS
+   | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, 
cast('2021-05-06' as date) as dt
+ """.stripMargin
+  )
+
+  assertResult(WriteOperationType.BULK_INSERT) {
+HoodieSparkSqlTestBase.getLastCommitMetadata(spark, 
tablePath).getOperationType
+  }
+  checkAnswer(s"select id, name, price, ts, cast(dt as string) from 
$tableName")(
+Seq(1, "a1", 10, 1000, "2021-05-06")
+  )
+
+  val partitionValue = "2021-05-06"
+
+  // Check the missing properties for spark sql
+  val metaClient = HoodieTableMetaClient.builder()
+.setBasePath(tablePath)
+.setConf(HoodieTestUtils.getDefaultStorageConf)
+.build()
+  val properties = metaClient.getTableConfig.getProps.asScala.toMap
+  
assertResult(true)(properties.contains(HoodieTableConfig.CREATE_SCHEMA.key))
+  assertResult("dt")(properties(HoodieTableConfig.PARTITION_FIELDS.key))
+  assertResult("ts")(properties(HoodieTableConfig.PRECOMBINE_FIELD.key))
+  assertResult(tableName)(metaClient.getTableConfig.getTableName)
+
+  // Test insert into
+  spark.sql(s"insert into $tableName values(2, 'a2', 10, 1000, 
cast('$partitionValue' as date))")
+  checkAnswer(s"select _hoodie_record_key, _hoodie_partition_path, id, 
name, price, ts, cast(dt as string) from $tableName order by id")(

Review Comment:
   I don't seem to see the validation on the partition pruning, e.g., in the 
logical plan, in the tests in this class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2086335812

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 9d1ac2a1bd9f2343174a0273437e7a240294eee4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23575)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Implement secondary index [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11129:
URL: https://github.com/apache/hudi/pull/11129#issuecomment-2086299511

   
   ## CI report:
   
   * 0274004b842a332f57c1104de44e4e262ff2942d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23576)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 5:56 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 


was (Author: vc):
h3.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842466#comment-17842466
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 5:56 PM:
---

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format


was (Author: vc):
[WIP] Approach 2 : Introduce pointer data blocks into storage format

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842466#comment-17842466
 ] 

Vinoth Chandar commented on HUDI-1045:
--

[WIP] Approach 2 : Introduce pointer data blocks into storage format

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842465#comment-17842465
 ] 

Vinoth Chandar commented on HUDI-1045:
--

h3.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841372#comment-17841372
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 5:54 PM:
---

At first it may seem trivial to have clustering fail all the time, ceding 
preference to the incoming writes. But, aside from wasting resources, 
clustering can finish before writes and we cannot atomically both rollback 
clustering (note that restoring a completed action is considered/recommended as 
an offline maintenance) as well as finish the write..


was (Author: vc):
At first it may seem trivial to have clustering fail all the time, ceding 
preference to the incoming writes. But, aside from wasting resources, 
clustering can finish before writes and we cannot atomically both rollback 
clustering as well as finish the write..

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7146] Implement secondary index [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #11129:
URL: https://github.com/apache/hudi/pull/11129#issuecomment-2086259091

   
   ## CI report:
   
   * 0274004b842a332f57c1104de44e4e262ff2942d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2086257066

   
   ## CI report:
   
   * 879e07c167692250636215e06e67b6c370496c03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23566)
 
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 9d1ac2a1bd9f2343174a0273437e7a240294eee4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23575)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585254463


##
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##
@@ -67,6 +70,61 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
 throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileColumnRanges List of column range statistics for each file in 
a partition
+   */
+  public static > HoodieColumnRangeMetadata 
getColumnRangeInPartition(@Nonnull List> 
fileColumnRanges) {
+if (fileColumnRanges.size() == 1) {
+  // Only one parquet file, we can just return that range.
+  return fileColumnRanges.get(0);
+}
+// There are multiple files. Compute min(file_mins) and max(file_maxs)
+return fileColumnRanges.stream()
+.sequential()
+.reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static > HoodieColumnRangeMetadata 
mergeRanges(HoodieColumnRangeMetadata one,

Review Comment:
   Got it.  I'm trying to understand how col stats are merged between the base 
and log files, e.g., for custom payload, because we may not be able to simply 
take the minimum of all minimum values or the maximum of all maximum values 
from the base and log files, as there can be deletes and there can be 
event-time based merging.  We can discuss this in a separate thread.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585250523


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1872,4 +1883,175 @@ public HoodieRecord next() {
   }
 };
   }
+
+  public static HoodieData 
convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,
+ 
List partitionInfoList,

Review Comment:
   I see. Is this for initializing the partition stats only?  Do we have a 
guard to throw an exception if the colstats or partition stats partition is 
initialized on the latest file slices with log files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1585240628


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1872,4 +1883,175 @@ public HoodieRecord next() {
   }
 };
   }
+
+  public static HoodieData 
convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,
+ 
List partitionInfoList,
+ 
HoodieMetadataConfig metadataConfig,
+ 
HoodieTableMetaClient dataTableMetaClient) {
+final List columnsToIndex = 
metadataConfig.getColumnsEnabledForColumnStatsIndex();
+if (columnsToIndex.isEmpty()) {
+  return engineContext.emptyHoodieData();
+}
+LOG.debug("Indexing following columns for partition stats index: {}", 
columnsToIndex);
+// Create records for MDT
+int parallelism = Math.max(Math.min(partitionInfoList.size(), 
metadataConfig.getPartitionStatsIndexParallelism()), 1);
+return engineContext.parallelize(partitionInfoList, 
parallelism).flatMap(partitionInfo -> {
+  final String partitionPath = partitionInfo.getRelativePath();
+  // Step 1: Collect Column Metadata for Each File (Your existing code 
does this)
+  List>> fileColumnMetadata = 
partitionInfo.getFileNameToSizeMap().keySet().stream()
+  .map(fileName -> getFileStatsRangeMetadata(partitionPath, 
partitionPath + "/" + fileName, dataTableMetaClient, columnsToIndex, false))
+  .collect(toList());
+  // Step 2: Flatten and Group by Column Name
+  Map>> 
columnMetadataMap = fileColumnMetadata.stream()
+  .flatMap(List::stream) // Flatten the list
+  
.collect(Collectors.groupingBy(HoodieColumnRangeMetadata::getColumnName, 
toList())); // Group by column name
+  // Step 3: Aggregate Column Ranges
+  Stream> 
partitionStatsRangeMetadata = columnMetadataMap.entrySet().stream()
+  .map(entry -> 
BaseFileUtils.getColumnRangeInPartition(entry.getValue()));
+  return HoodieMetadataPayload.createPartitionStatsRecords(partitionPath, 
partitionStatsRangeMetadata.collect(toList()), false).iterator();
+});
+  }
+
+  private static List> 
getFileStatsRangeMetadata(String partitionPath,
+   
String filePath,
+   
HoodieTableMetaClient datasetMetaClient,
+   
List columnsToIndex,
+   
boolean isDeleted) {
+String filePartitionPath = filePath.startsWith("/") ? 
filePath.substring(1) : filePath;
+String fileName = FSUtils.getFileName(filePath, partitionPath);
+if (isDeleted) {
+  return columnsToIndex.stream()
+  .map(entry -> HoodieColumnRangeMetadata.stub(fileName, entry))
+  .collect(Collectors.toList());
+}
+return readColumnRangeMetadataFrom(filePartitionPath, datasetMetaClient, 
columnsToIndex);
+  }
+
+  public static HoodieData 
convertMetadataToPartitionStatsRecords(HoodieCommitMetadata commitMetadata,

Review Comment:
   Got it, makes sense.  We can revisit later why it needs two types of 
methods/code path for generating partition stats records (track it in JIRA).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2086028226

   
   ## CI report:
   
   * 879e07c167692250636215e06e67b6c370496c03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23566)
 
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 9d1ac2a1bd9f2343174a0273437e7a240294eee4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7146] Implement secondary index [hudi]

2024-04-30 Thread via GitHub


codope opened a new pull request, #11129:
URL: https://github.com/apache/hudi/pull/11129

   ### Change Logs
   
   This PR is stacked on #11077. Main changes done here:
   
   - New index type added in `MetadataPartitionType`
   - Initialization of the new index in `HoodieBackedTableMetadataWriter`
   - Util methods to support index creation and update in 
`HoodieTableMetadataUtil`
   
   Another PR will follow this PR, wherein we will have the query-side 
integration and more e2e tests.
   
   ### Impact
   
   Builds secondary index
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2086001550

   
   ## CI report:
   
   * 879e07c167692250636215e06e67b6c370496c03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23566)
 
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7693) Allow Vectorized Reading for bootstrap in the new fg reader under some conditions

2024-04-30 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7693:
-

 Summary: Allow Vectorized Reading for bootstrap in the new fg 
reader under some conditions
 Key: HUDI-7693
 URL: https://issues.apache.org/jira/browse/HUDI-7693
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, spark-sql
Reporter: Jonathan Vexler


Vectorized reading can be used for bootstrap if we don't need to do any 
merging. Additionally, it can be used if no filters are pushed down.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-30 Thread via GitHub


hudi-bot commented on PR #10352:
URL: https://github.com/apache/hudi/pull/10352#issuecomment-2085972850

   
   ## CI report:
   
   * f63dbe172cf8dec2603c266396fb7d31d5cb7f60 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23574)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >