Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136656385

   
   ## CI report:
   
   * c65b89e9ece1ff465ab50a26c439193a08778a77 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136656307

   
   ## CI report:
   
   * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11357:
URL: https://github.com/apache/hudi/pull/11357#issuecomment-2136578848

   
   ## CI report:
   
   * 088a3818554187e2819bb14af71758586541fdd9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11355:
URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136578767

   
   ## CI report:
   
   * 3c915d6faa29733255351f774c8332784bffd294 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107)
 
   * dd33f1d60747faad458a850e731b604cde17ff05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11352:
URL: https://github.com/apache/hudi/pull/11352#issuecomment-2136578671

   
   ## CI report:
   
   * 1f275a236d8721168a567e9db9fadd3c6d1af93f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11354:
URL: https://github.com/apache/hudi/pull/11354#issuecomment-2136578713

   
   ## CI report:
   
   * 77d59e3e1e8f26589ade18fc475c19f0848edf0d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24113)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11356:
URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136578807

   
   ## CI report:
   
   * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108)
 
   * c6ee78d4242a4c05c7c0c12376818b383c21c800 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #10668:
URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136577547

   
   ## CI report:
   
   * 1b2ecc10deaab20be3ff90ccf2e4510a6d2e1d41 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23797)
 
   * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24111)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11357:
URL: https://github.com/apache/hudi/pull/11357#issuecomment-2136570130

   
   ## CI report:
   
   * 088a3818554187e2819bb14af71758586541fdd9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11356:
URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136570059

   
   ## CI report:
   
   * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108)
 
   * c6ee78d4242a4c05c7c0c12376818b383c21c800 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11355:
URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136569969

   
   ## CI report:
   
   * 3c915d6faa29733255351f774c8332784bffd294 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107)
 
   * dd33f1d60747faad458a850e731b604cde17ff05 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11354:
URL: https://github.com/apache/hudi/pull/11354#issuecomment-2136569884

   
   ## CI report:
   
   * 77d59e3e1e8f26589ade18fc475c19f0848edf0d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11352:
URL: https://github.com/apache/hudi/pull/11352#issuecomment-2136569833

   
   ## CI report:
   
   * 1f275a236d8721168a567e9db9fadd3c6d1af93f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #10668:
URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136568814

   
   ## CI report:
   
   * 1b2ecc10deaab20be3ff90ccf2e4510a6d2e1d41 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23797)
 
   * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan commented on code in PR #11352:
URL: https://github.com/apache/hudi/pull/11352#discussion_r1618219678


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java:
##
@@ -88,6 +88,9 @@ public static KeyGenerator createKeyGenerator(String 
keyGeneratorClass, TypedPro
 //Need to prevent overwriting the keygen for spark sql merge into 
because we need to extract
 //the recordkey from the meta cols if it exists. Sql keygen will use 
pkless keygen if needed.
 && !props.getBoolean(SPARK_SQL_MERGE_INTO_PREPPED_KEY, false);
+if (autoRecordKeyGen) {

Review Comment:
   HoodieSparkKeyGeneratorFactory.createKeyGenerator is the one which is called 
from spark-sql layer, HoodieSparkSqlWriter as well. 
   
   https://github.com/apache/hudi/assets/513218/4d4862f5-c107-4979-8b89-32ed939e3b9f";>
   
   https://github.com/apache/hudi/assets/513218/5d7f8f73-392d-4011-b662-86cfcf309e2e";>
   
   So, we should be good. atleast these are the places where auto key gen will 
be wrapped. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11355:
URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136556656

   
   ## CI report:
   
   * 3c915d6faa29733255351f774c8332784bffd294 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136556626

   
   ## CI report:
   
   * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]

2024-05-28 Thread via GitHub


nsivabalan commented on PR #10668:
URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136556906

   @codope : ready to review. fixed the test failure. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7808:
-
Labels: pull-request-available  (was: )

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]

2024-05-28 Thread via GitHub


yihua opened a new pull request, #11357:
URL: https://github.com/apache/hudi/pull/11357

   ### Change Logs
   
   This PR upgrades io.acryl:datahub-client from 0.8.31 to 0.8.45 to fix 
security vulnerabilities.
   
   Vulnerabilities that will be fixed with an upgrade:
   
   |  | Issue | Score | Upgrade 
   
:-:|:-|:-|:-
   ![critical 
severity](https://res.cloudinary.com/snyk/image/upload/w_20,h_20/v1561977819/icon/c.png
 'critical severity') | Incorrect Implementation of Authentication Algorithm 
[SNYK-JAVA-IOACRYL-3092129](https://snyk.io/vuln/SNYK-JAVA-IOACRYL-3092129)
 |   **811**   |  io.acryl:datahub-client:  `0.8.31` -> 
`0.8.45`  `Proof of Concept`
   
   ### Impact
   
   Security vulnerability fix.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-28 Thread via GitHub


codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1618200904


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String 
initializationTime, List secondaryIndexPartitionsToInit = 
getSecondaryIndexPartitionsToInit();
+if (secondaryIndexPartitionsToInit.isEmpty()) {
+  continue;
+}
+ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() 
== 1, "Only one secondary index at a time is supported for now");

Review Comment:
   I see your point. Instead of failing, we should still continue to bootstrap 
other index types. I will change the logic a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


yihua commented on code in PR #11352:
URL: https://github.com/apache/hudi/pull/11352#discussion_r1618200620


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java:
##
@@ -88,6 +88,9 @@ public static KeyGenerator createKeyGenerator(String 
keyGeneratorClass, TypedPro
 //Need to prevent overwriting the keygen for spark sql merge into 
because we need to extract
 //the recordkey from the meta cols if it exists. Sql keygen will use 
pkless keygen if needed.
 && !props.getBoolean(SPARK_SQL_MERGE_INTO_PREPPED_KEY, false);
+if (autoRecordKeyGen) {

Review Comment:
   So `HoodieAvroKeyGeneratorFactory#createAvroKeyGeneratorByType` should be 
okay?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]

2024-05-28 Thread via GitHub


lokeshj1703 commented on PR #10641:
URL: https://github.com/apache/hudi/pull/10641#issuecomment-2136528074

   @yihua can we merge this PR? The CI is green now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136515317

   
   ## CI report:
   
   * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097)
 
   * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11356:
URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136515459

   
   ## CI report:
   
   * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136515383

   
   ## CI report:
   
   * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099)
 
   * c65b89e9ece1ff465ab50a26c439193a08778a77 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136508670

   
   ## CI report:
   
   * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099)
 
   * c65b89e9ece1ff465ab50a26c439193a08778a77 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136508620

   
   ## CI report:
   
   * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097)
 
   * da5fde85d00993e849321dcadf001fbcd524d1d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Archive operation only release lock when get [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11160:
URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136508351

   
   ## CI report:
   
   * f1d9aeac0bb49519c41cfd03a2a78b7af0cf1f65 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


xicm commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136504486

   Thanks Danny, eric9204.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Archive operation only release lock when get [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11160:
URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136501973

   
   ## CI report:
   
   * f1d9aeac0bb49519c41cfd03a2a78b7af0cf1f65 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23702)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Archive operation only release lock when get [hudi]

2024-05-28 Thread via GitHub


Ytimetravel commented on PR #11160:
URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136501065

   @danny0405 Sorry, I just saw it. To my knowledge, it is not be related to my 
modifications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Archive operation only release lock when get [hudi]

2024-05-28 Thread via GitHub


Ytimetravel commented on PR #11160:
URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136488336

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11356:
URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136468653

   
   ## CI report:
   
   * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11355:
URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136468636

   
   ## CI report:
   
   * 3c915d6faa29733255351f774c8332784bffd294 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136468619

   
   ## CI report:
   
   * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105)
 
   * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24106)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136463394

   
   ## CI report:
   
   * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105)
 
   * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11356:
URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136463503

   
   ## CI report:
   
   * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11355:
URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136463426

   
   ## CI report:
   
   * 3c915d6faa29733255351f774c8332784bffd294 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7809:
-
Labels: hoodie-storage pull-request-available  (was: hoodie-storage)

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> With Hudi 0.14.1, without 
> "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi 
> query in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the 
> Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
>   ... 47 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gener

[PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo … [hudi]

2024-05-28 Thread via GitHub


yihua opened a new pull request, #11355:
URL: https://github.com/apache/hudi/pull/11355

   …serde
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With Hudi 0.14.1, without 
"spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query 
in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the Kryo 
registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPa

[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With Hudi 0.14.1, without 
"spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query 
in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the Kryo 
registratrar, the 
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.com

Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content 
string, not the path string,so keep it with getLock:
   2192894 [Thread-54214] WARN  
org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running 
logformatwriter hook
   2192907 [Thread-62779] WARN  
org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running 
logformatwriter hook
   2192907 [Thread-62781] WARN  
org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running 
logformatwriter hook
   [INFO] 
   [INFO] Results:
   [INFO] 
   [ERROR] Failures: 
   [ERROR]   TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 
expected:  but was: 
   [ERROR]   TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 
expected:  but was: 
   [ERROR]   TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 
expected:  but was: 
   [ERROR]   TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 
expected:  but was: 
   [ERROR] Errors: 
   [ERROR]   
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict » 
IO
   [ERROR]   
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570
 » Execution
   [ERROR]   
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570
 » Execution
   [ERROR]   
TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570
 » Execution
   [INFO] 
   [ERROR] Tests run: 581, Failures: 4, Errors: 4, Skipped: 0
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content 
string, not the path string,so keep it with getLock



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content, 
not the path string,so keep it with getLock



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136415767

   
   ## CI report:
   
   * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue [hudi]

2024-05-28 Thread via GitHub


Ssv-21 commented on issue #11349:
URL: https://github.com/apache/hudi/issues/11349#issuecomment-2136389551

   Hi soumil, 
   Thank you very much for your response,
   From the approaches that you suggested above,
   
   **approach 1) Using DataSource or DeltaStreamer by setting 
_hoodie_is_deleted :**
   I didn't try this one because I don't want to send my entire data again.
   
   **approach 2) Using 'hoodie.datasource.write.operation': 'delete':**
   I tried this one but was unable to delete the data, and you can find the 
details for the same in my question.
   
   But interestingly, these things caught my attention.
   ```
   os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk@11'
   HUDI_VERSION = '0.14.0'
   SPARK_VERSION = '3.4'
   
   SUBMIT_ARGS = f"--packages 
org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} 
pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   ```
   
   Do I need to set these things in my glue job ?
   Because i am not using any additional jars or packages in my glue job just 
the following ones in the job parameters,
   `-- datalake-formats hudi`
   and 
   ```
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
   --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
   --conf 'spark.sql.hive.convertMetastoreParquet=false' 
   ```
   and I am using the native python,spark,hudi versions supported in glue 4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136380318

   
   ## CI report:
   
   * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100)
 
   * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136372598

   
   ## CI report:
   
   * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100)
 
   * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
commit to retain

[jira] [Updated] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7655:
--
Fix Version/s: 1.0.0

> Support configuration for clean to fail execution if there is at least one 
> file is marked as a failed delete
> 
>
> Key: HUDI-7655
> URL: https://issues.apache.org/jira/browse/HUDI-7655
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: clean, pull-request-available
> Fix For: 1.0.0
>
>
> When a HUDI clean plan is executed, any targeted file that was not confirmed 
> as deleted (or non-existing) will be marked as a "failed delete". Although 
> these failed deletes will be added to `.clean` metadata, if incremental clean 
> is used then these files might not ever be picked up again as a future clean 
> plan, unless a "full-scan" clean ends up being scheduled. In addition to 
> leading to more files unnecessarily taking up storage space for longer, then 
> can lead to the following dataset consistency issue for COW datasets:
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some completed instants later an incremental clean is scheduled. It moves 
> the "earliest commit to retain" to an time after instant time RC2, so it 
> targets f1 for deletion. But during execution of the plan, it fails to delete 
> f1.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> At this point, any job/query that reads the aforementioned partition directly 
> from the DFS file system calls (without directly using MDT FILES partition) 
> will consider both f1 and f2 as valid file groups, since RC2 is no longer in 
> active timeline. This is a data consistency issue, and will only be resolved 
> if a "full-scan" clean is triggered and deletes f1.
> This specific scenario can be avoided if the user can configure HUDI clean to 
> fail execution of a clean plan unless all files are confirmed as deleted (or 
> not existing in DFS already), "blocking" the clean. The next clean attempt 
> will re-execute this existing plan, since clean plans cannot be "rolled 
> back". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With 0.14
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.schedule

[jira] [Assigned] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7809:
---

Assignee: Ethan Guo

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7809:
---

 Summary: Use Spark SerializableConfiguration to avoid NPE in Kryo 
serde
 Key: HUDI-7809
 URL: https://issues.apache.org/jira/browse/HUDI-7809
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Fix Version/s: 0.15.0
   1.0.0

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Labels: hoodie-storage  (was: )

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618052644


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   > There is `getCurrentOwnerLockInfo` which can identify where the lock 
triggers from.
   
   Had changed to getCurrentOwnerLockInfo @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618051269


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   In our condition can delete all lock file success and this way is useful 
directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


xuzifu666 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618051269


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   In our condition can delete all lock file success and this way is directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


danny0405 commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136344712

   Thanks for the contribution, I have reviewed and applied a patch here: 
   
[7804.patch.zip](https://github.com/apache/hudi/files/15477417/7804.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


KnightChess commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618035707


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   @xuzifu666 it will cause data quality problems. how about set lock expire 
time?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   @danny0405 I think `getCurrentOwnerLockInfo` can only identify the lock 
trigger, but the trigger is failed, no one can delete the lock file. I think 
the better solution is set expire time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2136326212

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 00b4e2d253b857c9387964eacfbfa46949a21580 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24092)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11347:
URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136320922

   
   ## CI report:
   
   * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136320894

   
   ## CI report:
   
   * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11265:
URL: https://github.com/apache/hudi/pull/11265#issuecomment-2136320701

   
   ## CI report:
   
   * 5699a637e30e3fafd056ab25d668640c5c8700c0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5505) Compaction NUM_COMMITS policy should only judge completed deltacommit

2024-05-28 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5505.

Fix Version/s: 1.0.0
Reviewers: Danny Chen
   Resolution: Fixed

Fixed via master branch: 42243862f0271fda16e70afdbfde61b47792ff70

> Compaction NUM_COMMITS policy should only judge completed deltacommit
> -
>
> Key: HUDI-5505
> URL: https://issues.apache.org/jira/browse/HUDI-5505
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, table-service
>Reporter: HunterXHunter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: image-2023-01-05-13-10-57-918.png
>
>
> `compaction.delta_commits =1`
>  
> {code:java}
> 20230105115229301.deltacommit
> 20230105115229301.deltacommit.inflight
> 20230105115229301.deltacommit.requested
> 20230105115253118.commit
> 20230105115253118.compaction.inflight
> 20230105115253118.compaction.requested
> 20230105115330994.deltacommit.inflight
> 20230105115330994.deltacommit.requested{code}
> The return result of `ScheduleCompactionActionExecutor.needCompact ` is 
> `true`, 
> This should not be expected.
>  
> And In the `Occ` or `lazy clean` mode,this will cause compaction trigger 
> early.
> `compaction.delta_commits =3`
>  
> {code:java}
> 20230105125650541.deltacommit.inflight
> 20230105125650541.deltacommit.requested
> 20230105125715081.deltacommit
> 20230105125715081.deltacommit.inflight
> 20230105125715081.deltacommit.requested
> 20230105130018070.deltacommit.inflight
> 20230105130018070.deltacommit.requested {code}
>  
> And compaction will be trigger, this should not be expected.
> !image-2023-01-05-13-10-57-918.png|width=699,height=158!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]

2024-05-28 Thread via GitHub


danny0405 merged PR #11251:
URL: https://github.com/apache/hudi/pull/11251


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-5505] Fix counting of delta commits since last compaction in ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251)

2024-05-28 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 42243862f02 [HUDI-5505] Fix counting of delta commits since last 
compaction in ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251)
42243862f02 is described below

commit 42243862f0271fda16e70afdbfde61b47792ff70
Author: Alexander Erofeev <81576532+a-erof...@users.noreply.github.com>
AuthorDate: Wed May 29 07:24:20 2024 +0700

[HUDI-5505] Fix counting of delta commits since last compaction in 
ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251)
---
 .../compact/ScheduleCompactionActionExecutor.java  |  2 +-
 .../table/action/compact/TestHoodieCompactor.java  | 34 +-
 .../apache/hudi/common/util/CompactionUtils.java   | 15 ++
 3 files changed, 49 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
index 941d93fd350..5dc0f067a3d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
@@ -129,7 +129,7 @@ public class ScheduleCompactionActionExecutor 
extends BaseActionExec
 
   private Option> getLatestDeltaCommitInfo() {
 Option> deltaCommitsInfo =
-
CompactionUtils.getDeltaCommitsSinceLatestCompaction(table.getActiveTimeline());
+
CompactionUtils.getCompletedDeltaCommitsSinceLatestCompaction(table.getActiveTimeline());
 if (deltaCommitsInfo.isPresent()) {
   return Option.of(Pair.of(
   deltaCommitsInfo.get().getLeft().countInstants(),
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
index deda77376af..78086a64359 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
@@ -89,8 +89,12 @@ public class TestHoodieCompactor extends 
HoodieSparkClientTestHarness {
   }
 
   public HoodieWriteConfig getConfig() {
+return getConfig(1);
+  }
+
+  public HoodieWriteConfig getConfig(int numCommitsBeforeCompaction) {
 return getConfigBuilder()
-
.withCompactionConfig(HoodieCompactionConfig.newBuilder().withMaxNumDeltaCommitsBeforeCompaction(1).build())
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().withMaxNumDeltaCommitsBeforeCompaction(numCommitsBeforeCompaction).build())
 .withMetricsConfig(getMetricsConfig())
 .build();
   }
@@ -181,6 +185,34 @@ public class TestHoodieCompactor extends 
HoodieSparkClientTestHarness {
 }
   }
 
+  @Test
+  public void testNeedCompactionCondition() throws Exception {
+HoodieWriteConfig config = getConfig(3);
+try (SparkRDDWriteClient writeClient = getHoodieWriteClient(config)) {
+  // insert 100 records.
+  String newCommitTime = "100";
+  writeClient.startCommitWithTime(newCommitTime);
+
+  // commit 1
+  List records = dataGen.generateInserts(newCommitTime, 100);
+  JavaRDD recordsRDD = jsc.parallelize(records, 1);
+  writeClient.insert(recordsRDD, newCommitTime).collect();
+
+  // commit 2
+  updateRecords(config, "101", records);
+
+  // commit 3 (inflight)
+  newCommitTime = "102";
+  writeClient.startCommitWithTime(newCommitTime);
+  metaClient.getActiveTimeline().transitionRequestedToInflight(new 
HoodieInstant(State.REQUESTED,
+  HoodieTimeline.DELTA_COMMIT_ACTION, newCommitTime), Option.empty());
+
+  // check that compaction will not be scheduled
+  String compactionTime = "107";
+  assertFalse(writeClient.scheduleCompactionAtInstant(compactionTime, 
Option.empty()));
+}
+  }
+
   @Test
   public void testWriteStatusContentsAfterCompaction() throws Exception {
 // insert 100 records
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
index 321fdd0fce5..5c2270db919 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
@@ -274,6 +274,21 @@ public class CompactionUtils {
 return 
metaClient.getActiveTimeline().filterPendingCompactionTimeline().getInstants();
   }
 
+  /**
+   * 

Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]

2024-05-28 Thread via GitHub


danny0405 commented on code in PR #11347:
URL: https://github.com/apache/hudi/pull/11347#discussion_r1618024145


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -81,12 +85,16 @@ public void lock() {
 metrics.startLockApiTimerContext();
 if 
(!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS)) {
   metrics.updateLockNotAcquiredMetric();
+  if (getLockProvider() instanceof FileSystemBasedLockProvider) {
+fs.delete(new Path(((FileSystemBasedLockProvider) 
getLockProvider()).getLock()), true);

Review Comment:
   There is `getCurrentOwnerLockInfo` which can identify where the lock 
triggers from.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]

2024-05-28 Thread via GitHub


danny0405 commented on PR #11346:
URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136302609

   > Perhaps a Function named partitionIndexFunc can be added inside 
Bucketindexpartitor. When constructing Bucketindexpartitor, which calculation 
logic is used to initialize partitionIndexFunc is determined according to the 
relationship between parallelism and bucketNum.
   
   yeah, let's give it a try. at least the parallelism and bucket_num is a 
constant value in runtime.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7795] Fix loading of input splits from look up table reader (#11297)

2024-05-28 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2291e995feb [HUDI-7795] Fix loading of input splits from look up table 
reader (#11297)
2291e995feb is described below

commit 2291e995feb3027ac35305383518e4bd6677b9b1
Author: Danny Chan 
AuthorDate: Wed May 29 08:17:05 2024 +0800

[HUDI-7795] Fix loading of input splits from look up table reader (#11297)
---
 .../apache/hudi/table/lookup/HoodieLookupFunction.java |  2 +-
 .../hudi/table/lookup/HoodieLookupTableReader.java | 18 --
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java
index fe32a439f72..a43bf1189fb 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java
@@ -47,7 +47,7 @@ import java.util.List;
 import java.util.Map;
 
 /**
- * Lookup function for filesystem connector tables.
+ * Lookup function for Hoodie dimension table.
  *
  * Note: reference Flink FileSystemLookupFunction to avoid additional 
connector jar dependencies.
  */
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java
index 0460cd42691..642a03ee65b 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java
@@ -29,6 +29,9 @@ import org.jetbrains.annotations.Nullable;
 
 import java.io.IOException;
 import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
 
 /**
  * Hudi look up table reader.
@@ -41,6 +44,8 @@ public class HoodieLookupTableReader implements Serializable {
 
   private InputFormat inputFormat;
 
+  private List inputSplits;
+
   public HoodieLookupTableReader(SerializableSupplier> 
inputFormatSupplier, Configuration conf) {
 this.inputFormatSupplier = inputFormatSupplier;
 this.conf = conf;
@@ -49,15 +54,24 @@ public class HoodieLookupTableReader implements 
Serializable {
   public void open() throws IOException {
 this.inputFormat = inputFormatSupplier.get();
 inputFormat.configure(conf);
-InputSplit[] inputSplits = inputFormat.createInputSplits(1);
+this.inputSplits = 
Arrays.stream(inputFormat.createInputSplits(1)).collect(Collectors.toList());
 ((RichInputFormat) inputFormat).openInputFormat();
-inputFormat.open(inputSplits[0]);
+inputFormat.open(inputSplits.remove(0));
   }
 
   @Nullable
   public RowData read(RowData reuse) throws IOException {
 if (!inputFormat.reachedEnd()) {
   return (RowData) inputFormat.nextRecord(reuse);
+} else {
+  while (!inputSplits.isEmpty()) {
+// release the last itr first.
+inputFormat.close();
+inputFormat.open(inputSplits.remove(0));
+if (!inputFormat.reachedEnd()) {
+  return (RowData) inputFormat.nextRecord(reuse);
+}
+  }
 }
 return null;
   }



Re: [PR] [HUDI-7795] Fix loading of input splits from look up table reader [hudi]

2024-05-28 Thread via GitHub


danny0405 merged PR #11297:
URL: https://github.com/apache/hudi/pull/11297


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source

2024-05-28 Thread Davis Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang reassigned HUDI-7806:
-

Assignee: Davis Zhang

> Skip fail on data-loss for first commit on Kafka Source
> ---
>
> Key: HUDI-7806
> URL: https://issues.apache.org/jira/browse/HUDI-7806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Davis Zhang
>Assignee: Davis Zhang
>Priority: Major
>
> When the ingestion attempts to start from the beginning of the topic, we 
> should not fail on data loss since topic retention can cause failures when 
> some data is removed before our ingestion is able to fully read the offsets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7808:
---

 Summary: Security upgrade io.acryl:datahub-client from 0.8.31 to 
0.8.45
 Key: HUDI-7808
 URL: https://issues.apache.org/jira/browse/HUDI-7808
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7808:

Fix Version/s: 1.0.0

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7808:
---

Assignee: Ethan Guo

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136272540

   
   ## CI report:
   
   * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11224:
URL: https://github.com/apache/hudi/pull/11224#issuecomment-2136272306

   
   ## CI report:
   
   * 0c01b0781e8c49da0f07a2379050c2be204cf373 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24094)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11251:
URL: https://github.com/apache/hudi/pull/11251#issuecomment-2136272369

   
   ## CI report:
   
   * b1dc1229697ca188c09318a32c90d868c88d70f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24095)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7507] Adding timestamp ordering validation before creating requested timeli… [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11344:
URL: https://github.com/apache/hudi/pull/11344#issuecomment-2136272563

   
   ## CI report:
   
   * 0397e80a3f71a4c9180a08cdd03ad16d7f313661 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24098)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan closed pull request #11351: [HUDI-7807] Fixing spark-sql for pk less 
tables
URL: https://github.com/apache/hudi/pull/11351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan closed pull request #11353: [HUDI-7807] Fixing spark-sql for pk less 
tables
URL: https://github.com/apache/hudi/pull/11353


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan opened a new pull request, #11354:
URL: https://github.com/apache/hudi/pull/11354

   ### Change Logs
   
   Fixing spark-sql for pk less tables
   
   ### Impact
   
   Fixing spark-sql for pk less tables
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan opened a new pull request, #11353:
URL: https://github.com/apache/hudi/pull/11353

   ### Change Logs
   
   Fixing spark-sql for pk less tables
   
   ### Impact
   
   Fixing spark-sql for pk less tables
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] avoid listing files for empty tables [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2136232508

   
   ## CI report:
   
   * c62bc211274fbe2b31dd8d07d7ede8ecae5f6d64 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan opened a new pull request, #11352:
URL: https://github.com/apache/hudi/pull/11352

   ### Change Logs
   
   Fixing spark-sql for pk less tables
   
   ### Impact
   
   Fixing spark-sql for pk less tables
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7807:
-
Labels: pull-request-available  (was: )

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.com

[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]

2024-05-28 Thread via GitHub


nsivabalan opened a new pull request, #11351:
URL: https://github.com/apache/hudi/pull/11351

   ### Change Logs
   
   Fixing spark-sql for pk less tables.
   
   ### Impact
   
   Fixing spark-sql for pk less tables
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7807:
-

 Summary: spark-sql updates for a pk less table fails w/ 
partitioned table 
 Key: HUDI-7807
 URL: https://issues.apache.org/jira/browse/HUDI-7807
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: sivabalan narayanan


quick start fails when trying to UPDATE with spark-sql for a pk less table. 

 
{code:java}
         > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET fare 
= 25.0 WHERE rider = 'rider-D']
org.apache.hudi.exception.HoodieException: Unable to instantiate class 
org.apache.hudi.keygen.SimpleKeyGenerator
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
at 
org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
at 
org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
at 
org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWr

[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7807:
--
Fix Version/s: 0.15.0
   1.0.0

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.s

[jira] [Assigned] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7807:
-

Assignee: sivabalan narayanan

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
>   at 
> org.apache.spark.sql.execu

[jira] [Created] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source

2024-05-28 Thread Davis Zhang (Jira)
Davis Zhang created HUDI-7806:
-

 Summary: Skip fail on data-loss for first commit on Kafka Source
 Key: HUDI-7806
 URL: https://issues.apache.org/jira/browse/HUDI-7806
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Davis Zhang


When the ingestion attempts to start from the beginning of the topic, we should 
not fail on data loss since topic retention can cause failures when some data 
is removed before our ingestion is able to fully read the offsets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [MINOR] Improve TestHoodieHadoopIOFactory (#11342)

2024-05-28 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new cb9daa6c227 [MINOR] Improve TestHoodieHadoopIOFactory (#11342)
cb9daa6c227 is described below

commit cb9daa6c227dc42415aee71db0499935db4e1599
Author: Y Ethan Guo 
AuthorDate: Tue May 28 15:07:02 2024 -0700

[MINOR] Improve TestHoodieHadoopIOFactory (#11342)
---
 .../hudi/io/hadoop/TestHoodieHadoopIOFactory.java  | 37 --
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git 
a/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java
 
b/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java
index 86341b3c963..7aaf811e737 100644
--- 
a/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java
+++ 
b/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java
@@ -25,11 +25,14 @@ import org.apache.hudi.common.util.OrcUtils;
 import org.apache.hudi.common.util.ParquetUtils;
 import org.apache.hudi.hadoop.fs.HadoopFSUtils;
 import org.apache.hudi.io.storage.HoodieIOFactory;
+import org.apache.hudi.storage.HoodieStorage;
 import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.storage.hadoop.HoodieHadoopStorage;
 
 import org.junit.jupiter.api.Test;
 
+import java.io.IOException;
+
 import static 
org.apache.hudi.common.testutils.HoodieTestUtils.getDefaultStorageConf;
 import static org.apache.hudi.storage.HoodieStorageUtils.DEFAULT_URI;
 import static org.junit.jupiter.api.Assertions.assertThrows;
@@ -41,21 +44,23 @@ import static org.junit.jupiter.api.Assertions.assertTrue;
  */
 public class TestHoodieHadoopIOFactory {
   @Test
-  public void testGetFileFormatUtils() {
-HoodieIOFactory ioFactory = new HoodieHadoopIOFactory(
-new HoodieHadoopStorage(HadoopFSUtils.getFs(DEFAULT_URI, 
getDefaultStorageConf(;
-assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.parquet")) instanceof ParquetUtils);
-assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.orc")) instanceof OrcUtils);
-assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.hfile")) instanceof HFileUtils);
-assertThrows(
-UnsupportedOperationException.class,
-() -> ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.log")));
-
-assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.PARQUET) 
instanceof ParquetUtils);
-assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.ORC) instanceof 
OrcUtils);
-assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.HFILE) instanceof 
HFileUtils);
-assertThrows(
-UnsupportedOperationException.class,
-() -> ioFactory.getFileFormatUtils(HoodieFileFormat.HOODIE_LOG));
+  public void testGetFileFormatUtils() throws IOException {
+try (HoodieStorage storage =
+ new HoodieHadoopStorage(HadoopFSUtils.getFs(DEFAULT_URI, 
getDefaultStorageConf( {
+  HoodieIOFactory ioFactory = new HoodieHadoopIOFactory(storage);
+  assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.parquet")) instanceof ParquetUtils);
+  assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.orc")) instanceof OrcUtils);
+  assertTrue(ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.hfile")) instanceof HFileUtils);
+  assertThrows(
+  UnsupportedOperationException.class,
+  () -> ioFactory.getFileFormatUtils(new 
StoragePath("file:///a/b.log")));
+
+  assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.PARQUET) 
instanceof ParquetUtils);
+  assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.ORC) instanceof 
OrcUtils);
+  assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.HFILE) 
instanceof HFileUtils);
+  assertThrows(
+  UnsupportedOperationException.class,
+  () -> ioFactory.getFileFormatUtils(HoodieFileFormat.HOODIE_LOG));
+}
   }
 }



Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]

2024-05-28 Thread via GitHub


yihua commented on PR #11342:
URL: https://github.com/apache/hudi/pull/11342#issuecomment-2136184055

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/55f61eb9-30cf-419a-954e-69bada67862c";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]

2024-05-28 Thread via GitHub


yihua merged PR #11342:
URL: https://github.com/apache/hudi/pull/11342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2136179574

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 05041565a628a2883ec642ecdfcddf0d66f4e7ce Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11342:
URL: https://github.com/apache/hudi/pull/11342#issuecomment-2136172348

   
   ## CI report:
   
   * ba8bc967911aef6100211a064e5f8c98a1c62da9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24096)
 
   * ba8bc96 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11146:
URL: https://github.com/apache/hudi/pull/11146#issuecomment-2136163343

   
   ## CI report:
   
   * 98d96be857a9e87e74a4dee99cb31b162ade5aa3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]

2024-05-28 Thread via GitHub


hudi-bot commented on PR #11265:
URL: https://github.com/apache/hudi/pull/11265#issuecomment-2136127128

   
   ## CI report:
   
   * 134dda251bbfac3cc092ec3f64b3cf3c8b17e2e1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24017)
 
   * 5699a637e30e3fafd056ab25d668640c5c8700c0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >