Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
hudi-bot commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136656385 ## CI report: * c65b89e9ece1ff465ab50a26c439193a08778a77 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]
hudi-bot commented on PR #11343: URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136656307 ## CI report: * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]
hudi-bot commented on PR #11357: URL: https://github.com/apache/hudi/pull/11357#issuecomment-2136578848 ## CI report: * 088a3818554187e2819bb14af71758586541fdd9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11355: URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136578767 ## CI report: * 3c915d6faa29733255351f774c8332784bffd294 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107) * dd33f1d60747faad458a850e731b604cde17ff05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
hudi-bot commented on PR #11352: URL: https://github.com/apache/hudi/pull/11352#issuecomment-2136578671 ## CI report: * 1f275a236d8721168a567e9db9fadd3c6d1af93f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
hudi-bot commented on PR #11354: URL: https://github.com/apache/hudi/pull/11354#issuecomment-2136578713 ## CI report: * 77d59e3e1e8f26589ade18fc475c19f0848edf0d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24113) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11356: URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136578807 ## CI report: * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108) * c6ee78d4242a4c05c7c0c12376818b383c21c800 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
hudi-bot commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136577547 ## CI report: * 1b2ecc10deaab20be3ff90ccf2e4510a6d2e1d41 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23797) * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24111) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]
hudi-bot commented on PR #11357: URL: https://github.com/apache/hudi/pull/11357#issuecomment-2136570130 ## CI report: * 088a3818554187e2819bb14af71758586541fdd9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11356: URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136570059 ## CI report: * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108) * c6ee78d4242a4c05c7c0c12376818b383c21c800 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11355: URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136569969 ## CI report: * 3c915d6faa29733255351f774c8332784bffd294 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107) * dd33f1d60747faad458a850e731b604cde17ff05 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
hudi-bot commented on PR #11354: URL: https://github.com/apache/hudi/pull/11354#issuecomment-2136569884 ## CI report: * 77d59e3e1e8f26589ade18fc475c19f0848edf0d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
hudi-bot commented on PR #11352: URL: https://github.com/apache/hudi/pull/11352#issuecomment-2136569833 ## CI report: * 1f275a236d8721168a567e9db9fadd3c6d1af93f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
hudi-bot commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136568814 ## CI report: * 1b2ecc10deaab20be3ff90ccf2e4510a6d2e1d41 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23797) * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan commented on code in PR #11352: URL: https://github.com/apache/hudi/pull/11352#discussion_r1618219678 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java: ## @@ -88,6 +88,9 @@ public static KeyGenerator createKeyGenerator(String keyGeneratorClass, TypedPro //Need to prevent overwriting the keygen for spark sql merge into because we need to extract //the recordkey from the meta cols if it exists. Sql keygen will use pkless keygen if needed. && !props.getBoolean(SPARK_SQL_MERGE_INTO_PREPPED_KEY, false); +if (autoRecordKeyGen) { Review Comment: HoodieSparkKeyGeneratorFactory.createKeyGenerator is the one which is called from spark-sql layer, HoodieSparkSqlWriter as well. https://github.com/apache/hudi/assets/513218/4d4862f5-c107-4979-8b89-32ed939e3b9f";> https://github.com/apache/hudi/assets/513218/5d7f8f73-392d-4011-b662-86cfcf309e2e";> So, we should be good. atleast these are the places where auto key gen will be wrapped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11355: URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136556656 ## CI report: * 3c915d6faa29733255351f774c8332784bffd294 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136556626 ## CI report: * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
nsivabalan commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2136556906 @codope : ready to review. fixed the test failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7808: - Labels: pull-request-available (was: ) > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7808] Upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 [hudi]
yihua opened a new pull request, #11357: URL: https://github.com/apache/hudi/pull/11357 ### Change Logs This PR upgrades io.acryl:datahub-client from 0.8.31 to 0.8.45 to fix security vulnerabilities. Vulnerabilities that will be fixed with an upgrade: | | Issue | Score | Upgrade :-:|:-|:-|:- ![critical severity](https://res.cloudinary.com/snyk/image/upload/w_20,h_20/v1561977819/icon/c.png 'critical severity') | Incorrect Implementation of Authentication Algorithm [SNYK-JAVA-IOACRYL-3092129](https://snyk.io/vuln/SNYK-JAVA-IOACRYL-3092129) | **811** | io.acryl:datahub-client: `0.8.31` -> `0.8.45` `Proof of Concept` ### Impact Security vulnerability fix. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1618200904 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String initializationTime, List secondaryIndexPartitionsToInit = getSecondaryIndexPartitionsToInit(); +if (secondaryIndexPartitionsToInit.isEmpty()) { + continue; +} +ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() == 1, "Only one secondary index at a time is supported for now"); Review Comment: I see your point. Instead of failing, we should still continue to bootstrap other index types. I will change the logic a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
yihua commented on code in PR #11352: URL: https://github.com/apache/hudi/pull/11352#discussion_r1618200620 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java: ## @@ -88,6 +88,9 @@ public static KeyGenerator createKeyGenerator(String keyGeneratorClass, TypedPro //Need to prevent overwriting the keygen for spark sql merge into because we need to extract //the recordkey from the meta cols if it exists. Sql keygen will use pkless keygen if needed. && !props.getBoolean(SPARK_SQL_MERGE_INTO_PREPPED_KEY, false); +if (autoRecordKeyGen) { Review Comment: So `HoodieAvroKeyGeneratorFactory#createAvroKeyGeneratorByType` should be okay? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7395] Fix computation for metrics in HoodieMetadataMetrics [hudi]
lokeshj1703 commented on PR #10641: URL: https://github.com/apache/hudi/pull/10641#issuecomment-2136528074 @yihua can we merge this PR? The CI is green now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]
hudi-bot commented on PR #11343: URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136515317 ## CI report: * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097) * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11356: URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136515459 ## CI report: * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
hudi-bot commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136515383 ## CI report: * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099) * c65b89e9ece1ff465ab50a26c439193a08778a77 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
hudi-bot commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136508670 ## CI report: * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099) * c65b89e9ece1ff465ab50a26c439193a08778a77 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]
hudi-bot commented on PR #11343: URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136508620 ## CI report: * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097) * da5fde85d00993e849321dcadf001fbcd524d1d0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Archive operation only release lock when get [hudi]
hudi-bot commented on PR #11160: URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136508351 ## CI report: * f1d9aeac0bb49519c41cfd03a2a78b7af0cf1f65 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23702) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
xicm commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136504486 Thanks Danny, eric9204. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Archive operation only release lock when get [hudi]
hudi-bot commented on PR #11160: URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136501973 ## CI report: * f1d9aeac0bb49519c41cfd03a2a78b7af0cf1f65 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23702) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Archive operation only release lock when get [hudi]
Ytimetravel commented on PR #11160: URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136501065 @danny0405 Sorry, I just saw it. To my knowledge, it is not be related to my modifications. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Archive operation only release lock when get [hudi]
Ytimetravel commented on PR #11160: URL: https://github.com/apache/hudi/pull/11160#issuecomment-2136488336 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11356: URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136468653 ## CI report: * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11355: URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136468636 ## CI report: * 3c915d6faa29733255351f774c8332784bffd294 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24107) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136468619 ## CI report: * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105) * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24106) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136463394 ## CI report: * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105) * 9113b4b1c69edf45114a5a39aeb8ce5a9da77f29 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11356: URL: https://github.com/apache/hudi/pull/11356#issuecomment-2136463503 ## CI report: * 8c4882f3dfd93ae917ce334b5c77b8cc5b466031 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo serde [hudi]
hudi-bot commented on PR #11355: URL: https://github.com/apache/hudi/pull/11355#issuecomment-2136463426 ## CI report: * 3c915d6faa29733255351f774c8332784bffd294 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7809: - Labels: hoodie-storage pull-request-available (was: hoodie-storage) > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > With Hudi 0.14.1, without > "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi > query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the > Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration. > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:806) > at org.apache.spark.sql.Dataset.show(Dataset.scala:765) > at org.apache.spark.sql.Dataset.show(Dataset.scala:774) > ... 47 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$Gener
[PR] [HUDI-7809] Use Spark SerializableConfiguration to avoid NPE in Kryo … [hudi]
yihua opened a new pull request, #11355: URL: https://github.com/apache/hudi/pull/11355 …serde ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With Hudi 0.14.1, without "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration. {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPa
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With Hudi 0.14.1, without "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the Kryo registratrar, the {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.com
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content string, not the path string,so keep it with getLock: 2192894 [Thread-54214] WARN org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running logformatwriter hook 2192907 [Thread-62779] WARN org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running logformatwriter hook 2192907 [Thread-62781] WARN org.apache.hudi.common.table.log.HoodieLogFormatWriter [] - running logformatwriter hook [INFO] [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 expected: but was: [ERROR] TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 expected: but was: [ERROR] TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 expected: but was: [ERROR] TestHoodieClientMultiWriter.testHoodieClientBasicMultiWriter:374 expected: but was: [ERROR] Errors: [ERROR] TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict » IO [ERROR] TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570 » Execution [ERROR] TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570 » Execution [ERROR] TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict:570 » Execution [INFO] [ERROR] Tests run: 581, Failures: 4, Errors: 4, Skipped: 0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content string, not the path string,so keep it with getLock -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618120568 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: @danny0405 getCurrentOwnerLockInfo is not fitable due to it is file content, not the path string,so keep it with getLock -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136415767 ## CI report: * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue [hudi]
Ssv-21 commented on issue #11349: URL: https://github.com/apache/hudi/issues/11349#issuecomment-2136389551 Hi soumil, Thank you very much for your response, From the approaches that you suggested above, **approach 1) Using DataSource or DeltaStreamer by setting _hoodie_is_deleted :** I didn't try this one because I don't want to send my entire data again. **approach 2) Using 'hoodie.datasource.write.operation': 'delete':** I tried this one but was unable to delete the data, and you can find the details for the same in my question. But interestingly, these things caught my attention. ``` os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk@11' HUDI_VERSION = '0.14.0' SPARK_VERSION = '3.4' SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS os.environ['PYSPARK_PYTHON'] = sys.executable ``` Do I need to set these things in my glue job ? Because i am not using any additional jars or packages in my glue job just the following ones in the job parameters, `-- datalake-formats hudi` and ``` --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.sql.hive.convertMetastoreParquet=false' ``` and I am using the native python,spark,hudi versions supported in glue 4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136380318 ## CI report: * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100) * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24105) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136372598 ## CI report: * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100) * e69b17536dbecfd425c4867d40d7bd6b04bfafd6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7779: -- Description: Archiving commits from active timeline could lead to data consistency issues on some rarest of occasions. We should come up with proper guards to ensure we do not make such unintended archival. Major gap which we wanted to guard is: if someone disabled cleaner, archival should account for data consistency issues and ensure it bails out. We have a base guarding condition, where archival will stop at the earliest commit to retain based on latest clean commit metadata. But there are few other scenarios that needs to be accounted for. a. Keeping aside replace commits, lets dive into specifics for regular commits and delta commits. Say user configured clean commits to 4 and archival configs to 5 and 6. after t10, cleaner is supposed to clean up all file versions created at or before t6. Say cleaner did not run(for whatever reason for next 5 commits). Archival will certainly be guarded until earliest commit to retain based on latest clean commits. Corner case to consider: A savepoint was added to say t3 and later removed. and still the cleaner was never re-enabled. Even though archival would have been stopped at t3 (until savepoint is present),but once savepoint is removed, if archival is executed, it could archive commit t3. Which means, file versions tracked at t3 is still not yet cleaned by the cleaner. Reasoning: We are good here wrt data consistency. Up until cleaner runs next time, this older file versions might be exposed to the end-user. But time travel query is not intended for already cleaned up commits and hence this is not an issue. None of snapshot, time travel query or incremental query will run into issues as they are not supposed to poll for t3. At any later point, if cleaner is re-enabled, it will take care of cleaning up file versions tracked at t3 commit. Just that for interim period, some older file versions might still be exposed to readers. b. The more tricky part is when replace commits are involved. Since replace commit metadata in active timeline is what ensures the replaced file groups are ignored for reads, before archiving the same, cleaner is expected to clean them up fully. But are there chances that this could go wrong? Corner case to consider. Lets add onto above scenario, where t3 has a savepoint, and t4 is a replace commit which replaced file groups tracked in t3. Cleaner will skip cleaning up files tracked by t3(due to the presence of savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will be pointing to t6. And say savepoint for t3 is removed, but cleaner was disabled. In this state of the timeline, if archival is executed, (since t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead to data duplicates as both replaced file groups and new file groups from t4.rc would be exposed as valid file groups. In other words, if we were to summarize the different scenarios: i. replaced file group is never cleaned up. - ECTR(Earliest commit to retain) is less than this.rc and we are good. ii. replaced file group is cleaned up. - ECTR is > this.rc and is good to archive. iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full clean up did not happen. After savepoint is removed, and when archival is executed, we should avoid archiving the rc of interest. This is the gap we don't account for as of now. We have 3 options to go about to solve this. Option A: Let Savepoint deletion flow take care of cleaning up the files its tracking. cons: Savepoint's responsibility is not removing any data files. So, from a single user responsibility rule, this may not be right. Also, this clean up might need to do what a clean planner might actually be doing. ie. build file system view, understand if its supposed to be cleaned up already, and then only clean up the files which are supposed to be cleaned up. For eg, if a file group has only one file slice, it should not be cleaned up and scenarios like this. Option B: Since archival is the one which might cause data consistency issues, why not archival do the clean up. We need to account for concurrent cleans, failure and retry scenarios etc. Also, we might need to build the file system view and then take a call whether something needs to be cleaned up before archiving something. Cons: Again, the single user responsibility rule might be broken. Would be neat if cleaner takes care of deleting data files and archival only takes care of deleting/archiving timeline files. Option C: Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track another metadata named "EarliestCommitToArchive". Strictly speaking, earliest commit to retain
[jira] [Updated] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete
[ https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7655: -- Fix Version/s: 1.0.0 > Support configuration for clean to fail execution if there is at least one > file is marked as a failed delete > > > Key: HUDI-7655 > URL: https://issues.apache.org/jira/browse/HUDI-7655 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Minor > Labels: clean, pull-request-available > Fix For: 1.0.0 > > > When a HUDI clean plan is executed, any targeted file that was not confirmed > as deleted (or non-existing) will be marked as a "failed delete". Although > these failed deletes will be added to `.clean` metadata, if incremental clean > is used then these files might not ever be picked up again as a future clean > plan, unless a "full-scan" clean ends up being scheduled. In addition to > leading to more files unnecessarily taking up storage space for longer, then > can lead to the following dataset consistency issue for COW datasets: > # Insert at C1 creates file group f1 in partition > # Replacecommit at RC2 creates file group f2 in partition, and replaces f1 > # Any reader of partition that calls HUDI API (with or without using MDT) > will recognize that f1 should be ignored, as it has been replaced. This is > since RC2 instant file is in active timeline > # Some completed instants later an incremental clean is scheduled. It moves > the "earliest commit to retain" to an time after instant time RC2, so it > targets f1 for deletion. But during execution of the plan, it fails to delete > f1. > # An archive job eventually is triggered, and archives C1 and RC2. Note that > f1 is still in partition > At this point, any job/query that reads the aforementioned partition directly > from the DFS file system calls (without directly using MDT FILES partition) > will consider both f1 and f2 as valid file groups, since RC2 is no longer in > active timeline. This is a data consistency issue, and will only be resolved > if a "full-scan" clean is triggered and deletes f1. > This specific scenario can be avoided if the user can configure HUDI clean to > fail execution of a clean plan unless all files are confirmed as deleted (or > not existing in DFS already), "blocking" the clean. The next clean attempt > will re-execute this existing plan, since clean plans cannot be "rolled > back". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With 0.14 {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.schedule
[jira] [Assigned] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7809: --- Assignee: Ethan Guo > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
Ethan Guo created HUDI-7809: --- Summary: Use Spark SerializableConfiguration to avoid NPE in Kryo serde Key: HUDI-7809 URL: https://issues.apache.org/jira/browse/HUDI-7809 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Fix Version/s: 0.15.0 1.0.0 > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Labels: hoodie-storage (was: ) > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618052644 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: > There is `getCurrentOwnerLockInfo` which can identify where the lock triggers from. Had changed to getCurrentOwnerLockInfo @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618051269 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: In our condition can delete all lock file success and this way is useful directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
xuzifu666 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618051269 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: In our condition can delete all lock file success and this way is directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
danny0405 commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136344712 Thanks for the contribution, I have reviewed and applied a patch here: [7804.patch.zip](https://github.com/apache/hudi/files/15477417/7804.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
KnightChess commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618035707 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: @xuzifu666 it will cause data quality problems. how about set lock expire time? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: @danny0405 I think `getCurrentOwnerLockInfo` can only identify the lock trigger, but the trigger is failed, no one can delete the lock file. I think the better solution is set expire time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]
hudi-bot commented on PR #11154: URL: https://github.com/apache/hudi/pull/11154#issuecomment-2136326212 ## CI report: * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN * 00b4e2d253b857c9387964eacfbfa46949a21580 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24092) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
hudi-bot commented on PR #11347: URL: https://github.com/apache/hudi/pull/11347#issuecomment-2136320922 ## CI report: * 9b21ef4f96ed2f7d488b21fd325822968f1e74dd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24100) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
hudi-bot commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136320894 ## CI report: * 6ee4a5e6a866350f6a6116a37f09ec99feed2ab5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24099) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]
hudi-bot commented on PR #11265: URL: https://github.com/apache/hudi/pull/11265#issuecomment-2136320701 ## CI report: * 5699a637e30e3fafd056ab25d668640c5c8700c0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5505) Compaction NUM_COMMITS policy should only judge completed deltacommit
[ https://issues.apache.org/jira/browse/HUDI-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5505. Fix Version/s: 1.0.0 Reviewers: Danny Chen Resolution: Fixed Fixed via master branch: 42243862f0271fda16e70afdbfde61b47792ff70 > Compaction NUM_COMMITS policy should only judge completed deltacommit > - > > Key: HUDI-5505 > URL: https://issues.apache.org/jira/browse/HUDI-5505 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, table-service >Reporter: HunterXHunter >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: image-2023-01-05-13-10-57-918.png > > > `compaction.delta_commits =1` > > {code:java} > 20230105115229301.deltacommit > 20230105115229301.deltacommit.inflight > 20230105115229301.deltacommit.requested > 20230105115253118.commit > 20230105115253118.compaction.inflight > 20230105115253118.compaction.requested > 20230105115330994.deltacommit.inflight > 20230105115330994.deltacommit.requested{code} > The return result of `ScheduleCompactionActionExecutor.needCompact ` is > `true`, > This should not be expected. > > And In the `Occ` or `lazy clean` mode,this will cause compaction trigger > early. > `compaction.delta_commits =3` > > {code:java} > 20230105125650541.deltacommit.inflight > 20230105125650541.deltacommit.requested > 20230105125715081.deltacommit > 20230105125715081.deltacommit.inflight > 20230105125715081.deltacommit.requested > 20230105130018070.deltacommit.inflight > 20230105130018070.deltacommit.requested {code} > > And compaction will be trigger, this should not be expected. > !image-2023-01-05-13-10-57-918.png|width=699,height=158! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]
danny0405 merged PR #11251: URL: https://github.com/apache/hudi/pull/11251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-5505] Fix counting of delta commits since last compaction in ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 42243862f02 [HUDI-5505] Fix counting of delta commits since last compaction in ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251) 42243862f02 is described below commit 42243862f0271fda16e70afdbfde61b47792ff70 Author: Alexander Erofeev <81576532+a-erof...@users.noreply.github.com> AuthorDate: Wed May 29 07:24:20 2024 +0700 [HUDI-5505] Fix counting of delta commits since last compaction in ScheduleCompactionActionExecutor.getLatestDeltaCommitInfo (#11251) --- .../compact/ScheduleCompactionActionExecutor.java | 2 +- .../table/action/compact/TestHoodieCompactor.java | 34 +- .../apache/hudi/common/util/CompactionUtils.java | 15 ++ 3 files changed, 49 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java index 941d93fd350..5dc0f067a3d 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java @@ -129,7 +129,7 @@ public class ScheduleCompactionActionExecutor extends BaseActionExec private Option> getLatestDeltaCommitInfo() { Option> deltaCommitsInfo = - CompactionUtils.getDeltaCommitsSinceLatestCompaction(table.getActiveTimeline()); + CompactionUtils.getCompletedDeltaCommitsSinceLatestCompaction(table.getActiveTimeline()); if (deltaCommitsInfo.isPresent()) { return Option.of(Pair.of( deltaCommitsInfo.get().getLeft().countInstants(), diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java index deda77376af..78086a64359 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java @@ -89,8 +89,12 @@ public class TestHoodieCompactor extends HoodieSparkClientTestHarness { } public HoodieWriteConfig getConfig() { +return getConfig(1); + } + + public HoodieWriteConfig getConfig(int numCommitsBeforeCompaction) { return getConfigBuilder() - .withCompactionConfig(HoodieCompactionConfig.newBuilder().withMaxNumDeltaCommitsBeforeCompaction(1).build()) + .withCompactionConfig(HoodieCompactionConfig.newBuilder().withMaxNumDeltaCommitsBeforeCompaction(numCommitsBeforeCompaction).build()) .withMetricsConfig(getMetricsConfig()) .build(); } @@ -181,6 +185,34 @@ public class TestHoodieCompactor extends HoodieSparkClientTestHarness { } } + @Test + public void testNeedCompactionCondition() throws Exception { +HoodieWriteConfig config = getConfig(3); +try (SparkRDDWriteClient writeClient = getHoodieWriteClient(config)) { + // insert 100 records. + String newCommitTime = "100"; + writeClient.startCommitWithTime(newCommitTime); + + // commit 1 + List records = dataGen.generateInserts(newCommitTime, 100); + JavaRDD recordsRDD = jsc.parallelize(records, 1); + writeClient.insert(recordsRDD, newCommitTime).collect(); + + // commit 2 + updateRecords(config, "101", records); + + // commit 3 (inflight) + newCommitTime = "102"; + writeClient.startCommitWithTime(newCommitTime); + metaClient.getActiveTimeline().transitionRequestedToInflight(new HoodieInstant(State.REQUESTED, + HoodieTimeline.DELTA_COMMIT_ACTION, newCommitTime), Option.empty()); + + // check that compaction will not be scheduled + String compactionTime = "107"; + assertFalse(writeClient.scheduleCompactionAtInstant(compactionTime, Option.empty())); +} + } + @Test public void testWriteStatusContentsAfterCompaction() throws Exception { // insert 100 records diff --git a/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java b/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java index 321fdd0fce5..5c2270db919 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java @@ -274,6 +274,21 @@ public class CompactionUtils { return metaClient.getActiveTimeline().filterPendingCompactionTimeline().getInstants(); } + /** + *
Re: [PR] [HUDI-7805] FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed [hudi]
danny0405 commented on code in PR #11347: URL: https://github.com/apache/hudi/pull/11347#discussion_r1618024145 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -81,12 +85,16 @@ public void lock() { metrics.startLockApiTimerContext(); if (!getLockProvider().tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS)) { metrics.updateLockNotAcquiredMetric(); + if (getLockProvider() instanceof FileSystemBasedLockProvider) { +fs.delete(new Path(((FileSystemBasedLockProvider) getLockProvider()).getLock()), true); Review Comment: There is `getCurrentOwnerLockInfo` which can identify where the lock triggers from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7804] improve flink bucket index paritioner [hudi]
danny0405 commented on PR #11346: URL: https://github.com/apache/hudi/pull/11346#issuecomment-2136302609 > Perhaps a Function named partitionIndexFunc can be added inside Bucketindexpartitor. When constructing Bucketindexpartitor, which calculation logic is used to initialize partitionIndexFunc is determined according to the relationship between parallelism and bucketNum. yeah, let's give it a try. at least the parallelism and bucket_num is a constant value in runtime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7795] Fix loading of input splits from look up table reader (#11297)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 2291e995feb [HUDI-7795] Fix loading of input splits from look up table reader (#11297) 2291e995feb is described below commit 2291e995feb3027ac35305383518e4bd6677b9b1 Author: Danny Chan AuthorDate: Wed May 29 08:17:05 2024 +0800 [HUDI-7795] Fix loading of input splits from look up table reader (#11297) --- .../apache/hudi/table/lookup/HoodieLookupFunction.java | 2 +- .../hudi/table/lookup/HoodieLookupTableReader.java | 18 -- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java index fe32a439f72..a43bf1189fb 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupFunction.java @@ -47,7 +47,7 @@ import java.util.List; import java.util.Map; /** - * Lookup function for filesystem connector tables. + * Lookup function for Hoodie dimension table. * * Note: reference Flink FileSystemLookupFunction to avoid additional connector jar dependencies. */ diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java index 0460cd42691..642a03ee65b 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/lookup/HoodieLookupTableReader.java @@ -29,6 +29,9 @@ import org.jetbrains.annotations.Nullable; import java.io.IOException; import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.stream.Collectors; /** * Hudi look up table reader. @@ -41,6 +44,8 @@ public class HoodieLookupTableReader implements Serializable { private InputFormat inputFormat; + private List inputSplits; + public HoodieLookupTableReader(SerializableSupplier> inputFormatSupplier, Configuration conf) { this.inputFormatSupplier = inputFormatSupplier; this.conf = conf; @@ -49,15 +54,24 @@ public class HoodieLookupTableReader implements Serializable { public void open() throws IOException { this.inputFormat = inputFormatSupplier.get(); inputFormat.configure(conf); -InputSplit[] inputSplits = inputFormat.createInputSplits(1); +this.inputSplits = Arrays.stream(inputFormat.createInputSplits(1)).collect(Collectors.toList()); ((RichInputFormat) inputFormat).openInputFormat(); -inputFormat.open(inputSplits[0]); +inputFormat.open(inputSplits.remove(0)); } @Nullable public RowData read(RowData reuse) throws IOException { if (!inputFormat.reachedEnd()) { return (RowData) inputFormat.nextRecord(reuse); +} else { + while (!inputSplits.isEmpty()) { +// release the last itr first. +inputFormat.close(); +inputFormat.open(inputSplits.remove(0)); +if (!inputFormat.reachedEnd()) { + return (RowData) inputFormat.nextRecord(reuse); +} + } } return null; }
Re: [PR] [HUDI-7795] Fix loading of input splits from look up table reader [hudi]
danny0405 merged PR #11297: URL: https://github.com/apache/hudi/pull/11297 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source
[ https://issues.apache.org/jira/browse/HUDI-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Zhang reassigned HUDI-7806: - Assignee: Davis Zhang > Skip fail on data-loss for first commit on Kafka Source > --- > > Key: HUDI-7806 > URL: https://issues.apache.org/jira/browse/HUDI-7806 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Davis Zhang >Assignee: Davis Zhang >Priority: Major > > When the ingestion attempts to start from the beginning of the topic, we > should not fail on data loss since topic retention can cause failures when > some data is removed before our ingestion is able to fully read the offsets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
Ethan Guo created HUDI-7808: --- Summary: Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 Key: HUDI-7808 URL: https://issues.apache.org/jira/browse/HUDI-7808 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7808: Fix Version/s: 1.0.0 > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7808: --- Assignee: Ethan Guo > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]
hudi-bot commented on PR #11343: URL: https://github.com/apache/hudi/pull/11343#issuecomment-2136272540 ## CI report: * e872d7ce8dfa041b36cfa08808865279c9323592 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24097) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]
hudi-bot commented on PR #11224: URL: https://github.com/apache/hudi/pull/11224#issuecomment-2136272306 ## CI report: * 0c01b0781e8c49da0f07a2379050c2be204cf373 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24094) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]
hudi-bot commented on PR #11251: URL: https://github.com/apache/hudi/pull/11251#issuecomment-2136272369 ## CI report: * b1dc1229697ca188c09318a32c90d868c88d70f1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24095) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7507] Adding timestamp ordering validation before creating requested timeli… [hudi]
hudi-bot commented on PR #11344: URL: https://github.com/apache/hudi/pull/11344#issuecomment-2136272563 ## CI report: * 0397e80a3f71a4c9180a08cdd03ad16d7f313661 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24098) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan closed pull request #11351: [HUDI-7807] Fixing spark-sql for pk less tables URL: https://github.com/apache/hudi/pull/11351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan closed pull request #11353: [HUDI-7807] Fixing spark-sql for pk less tables URL: https://github.com/apache/hudi/pull/11353 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan opened a new pull request, #11354: URL: https://github.com/apache/hudi/pull/11354 ### Change Logs Fixing spark-sql for pk less tables ### Impact Fixing spark-sql for pk less tables ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan opened a new pull request, #11353: URL: https://github.com/apache/hudi/pull/11353 ### Change Logs Fixing spark-sql for pk less tables ### Impact Fixing spark-sql for pk less tables ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] avoid listing files for empty tables [hudi]
hudi-bot commented on PR #11155: URL: https://github.com/apache/hudi/pull/11155#issuecomment-2136232508 ## CI report: * c62bc211274fbe2b31dd8d07d7ede8ecae5f6d64 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan opened a new pull request, #11352: URL: https://github.com/apache/hudi/pull/11352 ### Change Logs Fixing spark-sql for pk less tables ### Impact Fixing spark-sql for pk less tables ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7807: - Labels: pull-request-available (was: ) > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.com
[PR] [HUDI-7807] Fixing spark-sql for pk less tables [hudi]
nsivabalan opened a new pull request, #11351: URL: https://github.com/apache/hudi/pull/11351 ### Change Logs Fixing spark-sql for pk less tables. ### Impact Fixing spark-sql for pk less tables ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
sivabalan narayanan created HUDI-7807: - Summary: spark-sql updates for a pk less table fails w/ partitioned table Key: HUDI-7807 URL: https://issues.apache.org/jira/browse/HUDI-7807 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: sivabalan narayanan quick start fails when trying to UPDATE with spark-sql for a pk less table. {code:java} > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'] org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.keygen.SimpleKeyGenerator at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) at org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at org.apache.spark.sql.DataFrameWr
[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7807: -- Fix Version/s: 0.15.0 1.0.0 > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.s
[jira] [Assigned] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7807: - Assignee: sivabalan narayanan > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) > at > org.apache.spark.sql.execu
[jira] [Created] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source
Davis Zhang created HUDI-7806: - Summary: Skip fail on data-loss for first commit on Kafka Source Key: HUDI-7806 URL: https://issues.apache.org/jira/browse/HUDI-7806 Project: Apache Hudi Issue Type: Improvement Reporter: Davis Zhang When the ingestion attempts to start from the beginning of the topic, we should not fail on data loss since topic retention can cause failures when some data is removed before our ingestion is able to fully read the offsets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [MINOR] Improve TestHoodieHadoopIOFactory (#11342)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new cb9daa6c227 [MINOR] Improve TestHoodieHadoopIOFactory (#11342) cb9daa6c227 is described below commit cb9daa6c227dc42415aee71db0499935db4e1599 Author: Y Ethan Guo AuthorDate: Tue May 28 15:07:02 2024 -0700 [MINOR] Improve TestHoodieHadoopIOFactory (#11342) --- .../hudi/io/hadoop/TestHoodieHadoopIOFactory.java | 37 -- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java b/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java index 86341b3c963..7aaf811e737 100644 --- a/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java +++ b/hudi-hadoop-common/src/test/java/org/apache/hudi/io/hadoop/TestHoodieHadoopIOFactory.java @@ -25,11 +25,14 @@ import org.apache.hudi.common.util.OrcUtils; import org.apache.hudi.common.util.ParquetUtils; import org.apache.hudi.hadoop.fs.HadoopFSUtils; import org.apache.hudi.io.storage.HoodieIOFactory; +import org.apache.hudi.storage.HoodieStorage; import org.apache.hudi.storage.StoragePath; import org.apache.hudi.storage.hadoop.HoodieHadoopStorage; import org.junit.jupiter.api.Test; +import java.io.IOException; + import static org.apache.hudi.common.testutils.HoodieTestUtils.getDefaultStorageConf; import static org.apache.hudi.storage.HoodieStorageUtils.DEFAULT_URI; import static org.junit.jupiter.api.Assertions.assertThrows; @@ -41,21 +44,23 @@ import static org.junit.jupiter.api.Assertions.assertTrue; */ public class TestHoodieHadoopIOFactory { @Test - public void testGetFileFormatUtils() { -HoodieIOFactory ioFactory = new HoodieHadoopIOFactory( -new HoodieHadoopStorage(HadoopFSUtils.getFs(DEFAULT_URI, getDefaultStorageConf(; -assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.parquet")) instanceof ParquetUtils); -assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.orc")) instanceof OrcUtils); -assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.hfile")) instanceof HFileUtils); -assertThrows( -UnsupportedOperationException.class, -() -> ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.log"))); - -assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.PARQUET) instanceof ParquetUtils); -assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.ORC) instanceof OrcUtils); -assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.HFILE) instanceof HFileUtils); -assertThrows( -UnsupportedOperationException.class, -() -> ioFactory.getFileFormatUtils(HoodieFileFormat.HOODIE_LOG)); + public void testGetFileFormatUtils() throws IOException { +try (HoodieStorage storage = + new HoodieHadoopStorage(HadoopFSUtils.getFs(DEFAULT_URI, getDefaultStorageConf( { + HoodieIOFactory ioFactory = new HoodieHadoopIOFactory(storage); + assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.parquet")) instanceof ParquetUtils); + assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.orc")) instanceof OrcUtils); + assertTrue(ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.hfile")) instanceof HFileUtils); + assertThrows( + UnsupportedOperationException.class, + () -> ioFactory.getFileFormatUtils(new StoragePath("file:///a/b.log"))); + + assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.PARQUET) instanceof ParquetUtils); + assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.ORC) instanceof OrcUtils); + assertTrue(ioFactory.getFileFormatUtils(HoodieFileFormat.HFILE) instanceof HFileUtils); + assertThrows( + UnsupportedOperationException.class, + () -> ioFactory.getFileFormatUtils(HoodieFileFormat.HOODIE_LOG)); +} } }
Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]
yihua commented on PR #11342: URL: https://github.com/apache/hudi/pull/11342#issuecomment-2136184055 Azure CI is green. https://github.com/apache/hudi/assets/2497195/55f61eb9-30cf-419a-954e-69bada67862c";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]
yihua merged PR #11342: URL: https://github.com/apache/hudi/pull/11342 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2136179574 ## CI report: * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN * 05041565a628a2883ec642ecdfcddf0d66f4e7ce Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24090) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Improve TestHoodieHadoopIOFactory [hudi]
hudi-bot commented on PR #11342: URL: https://github.com/apache/hudi/pull/11342#issuecomment-2136172348 ## CI report: * ba8bc967911aef6100211a064e5f8c98a1c62da9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24096) * ba8bc96 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
hudi-bot commented on PR #11146: URL: https://github.com/apache/hudi/pull/11146#issuecomment-2136163343 ## CI report: * 98d96be857a9e87e74a4dee99cb31b162ade5aa3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24091) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7774] Add Avro Logical type support for Merciful Java convertor [hudi]
hudi-bot commented on PR #11265: URL: https://github.com/apache/hudi/pull/11265#issuecomment-2136127128 ## CI report: * 134dda251bbfac3cc092ec3f64b3cf3c8b17e2e1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24017) * 5699a637e30e3fafd056ab25d668640c5c8700c0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24101) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org