[jira] [Updated] (HUDI-7415) OLAP query need support read data from origin table by default
[ https://issues.apache.org/jira/browse/HUDI-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-7415: - Description: OLAP query need support read data from origin table by default,for example,query from olap engine such as starrocks presto,we can only read data in ro/rt sub table and get empty data from origin table,this is not fitable: query mor table with starrocks as: MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt; {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}--- |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---| |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}--- 1 row in set (2.11 sec)| MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro; {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}--- |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---| |20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 {+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}--- 1 row in set (0.22 sec)| MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22; Empty set (1.23 sec) was: OLAP query need support read data from origin table by default,for example,query from olap engine such as starrocks presto,we can only read data in ro/rt sub table and get empty data from origin table,this is not fitable: query mor table with starrocks as: MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22; +-+---+++ | _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name +-+---+++ | 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 +-+---+++ 1 row in set (2.11 sec) MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro; +-+---+++ | _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name +-+---+++ | 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 +-+---+++ 1 row in set (0.22 sec) MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22; Empty set (1.23 sec) > OLAP query need support read data from origin table by default > -- > > Key: HUDI-7415 > URL: https://issues.apache.org/jira/browse/HUDI-7415 > Project: Apache Hudi > Issue Type: Improvement >Reporter: xy >Assignee: xy >Priority: Major > > OLAP query need support read data from origin table by default,for > example,query from olap engine such as starrocks presto,we can only read data > in ro/rt sub table and get empty data from origin table,this is not fitable: > query mor table with starrocks as: > MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt; > {+}{-}{-}{+}-{-}++{-}
[jira] [Created] (HUDI-7415) OLAP query need support read data from origin table by default
xy created HUDI-7415: Summary: OLAP query need support read data from origin table by default Key: HUDI-7415 URL: https://issues.apache.org/jira/browse/HUDI-7415 Project: Apache Hudi Issue Type: Improvement Reporter: xy Assignee: xy OLAP query need support read data from origin table by default,for example,query from olap engine such as starrocks presto,we can only read data in ro/rt sub table and get empty data from origin table,this is not fitable: query mor table with starrocks as: MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22; +-+---+++ | _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name +-+---+++ | 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 +-+---+++ 1 row in set (2.11 sec) MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro; +-+---+++ | _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name +-+---+++ | 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210 +-+---+++ 1 row in set (0.22 sec) MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22; Empty set (1.23 sec) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] Remove hive bugs [hudi]
hudi-bot commented on PR #10684: URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947849437 ## CI report: * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22474) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]
hudi-bot commented on PR #10682: URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947849397 ## CI report: * a8a65546c774415a5953a50f75442d9a9b558067 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947849223 ## CI report: * 804d73922a136f6fed0fdcc559bfb697bda4942e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22473) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]
ad1happy2go commented on issue #10609: URL: https://github.com/apache/hudi/issues/10609#issuecomment-1947804747 Had working session with @maheshguptags . We were able to consistently reproduce with composite key in his setup. although I couldn't reproduce in my setup. SO this issue is intermittent. @yihua Can you please check .hoodie (attached) as you requested. [hoodie.zip](https://github.com/apache/hudi/files/14307039/hoodie.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Remove hive bugs [hudi]
hudi-bot commented on PR #10684: URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947768029 ## CI report: * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22474) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947767855 ## CI report: * 50f21651c6c21b8a72c43247503b4d900d06a11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22418) * 804d73922a136f6fed0fdcc559bfb697bda4942e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22473) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Remove hive bugs [hudi]
hudi-bot commented on PR #10684: URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947762759 ## CI report: * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947762599 ## CI report: * 50f21651c6c21b8a72c43247503b4d900d06a11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22418) * 804d73922a136f6fed0fdcc559bfb697bda4942e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Clarify config descriptions (#10681)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9da1f2b15e2 [MINOR] Clarify config descriptions (#10681) 9da1f2b15e2 is described below commit 9da1f2b15e2bf873a7d3db56dbc0183479c38c4c Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Thu Feb 15 20:39:30 2024 -0800 [MINOR] Clarify config descriptions (#10681) This aligns with the doc change here: https://github.com/apache/hudi/pull/10680 --- .../src/main/scala/org/apache/hudi/DataSourceOptions.scala | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala index 99080629e17..47a7c61a60f 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala @@ -500,7 +500,9 @@ object DataSourceWriteOptions { .defaultValue("false") .markAdvanced() .withDocumentation("If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. " + - "This config is deprecated as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead."); + " **Note** Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config `hoodie.datasource.insert.dup.policy` instead " + + "for a simplified duplicate handling experience. The new config will be incorporated into all other writing flows and this config will be fully deprecated " + + "in future releases."); val PARTITIONS_TO_DELETE: ConfigProperty[String] = ConfigProperty .key("hoodie.datasource.write.partitions.to.delete") @@ -597,7 +599,7 @@ object DataSourceWriteOptions { .withValidValues(NONE_INSERT_DUP_POLICY, DROP_INSERT_DUP_POLICY, FAIL_INSERT_DUP_POLICY) .markAdvanced() .sinceVersion("0.14.0") -.withDocumentation("When operation type is set to \"insert\", users can optionally enforce a dedup policy. This policy will be employed " +.withDocumentation("**Note** This is only applicable to Spark SQL writing.When operation type is set to \"insert\", users can optionally enforce a dedup policy. This policy will be employed " + " when records being ingested already exists in storage. Default policy is none and no action will be taken. Another option is to choose " + " \"drop\", on which matching records from incoming will be dropped and the rest will be ingested. Third option is \"fail\" which will " + "fail the write operation when same records are re-ingested. In other words, a given record as deduced by the key generation policy " +
Re: [PR] [MINOR][DOCS] Clarify config descriptions [hudi]
nsivabalan merged PR #10681: URL: https://github.com/apache/hudi/pull/10681 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DOCS] Clarify config descriptions [hudi]
nsivabalan commented on PR #10681: URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947745524 https://github.com/apache/hudi/assets/513218/9190f75d-c679-47b4-beca-626cd3818499";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] MINOR_Remove_hive_bugs [hudi]
linliu-code opened a new pull request, #10684: URL: https://github.com/apache/hudi/pull/10684 ### Change Logs try to remove hive. ### Impact None. ### Risk level (write none, low medium or high below) None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] initial commit to update doris docs [hudi]
nfarah86 commented on PR #10683: URL: https://github.com/apache/hudi/pull/10683#issuecomment-1947735093 this pr is not ready yet- waiting for Doris to confirm the details -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] initial commit to update doris docs [hudi]
nfarah86 opened a new pull request, #10683: URL: https://github.com/apache/hudi/pull/10683 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ update doris doc with compatibility ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) none _If medium or high, explain what verification was done to mitigate the risks._ low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ update doris doc ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]
hudi-bot commented on PR #10682: URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947722192 ## CI report: * a8a65546c774415a5953a50f75442d9a9b558067 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]
hudi-bot commented on PR #10682: URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947717851 ## CI report: * a8a65546c774415a5953a50f75442d9a9b558067 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Cleanup FileSystemViewManager code [hudi]
voonhous opened a new pull request, #10682: URL: https://github.com/apache/hudi/pull/10682 ### Change Logs Cleaning up `FileSystemViewManager#createViewManager` related code that is passing around a Hadoop Configuration that is never used. Added a doctstring to indicate that in `#init` function in `HoodieTableMetaClient` is used for tests. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [X] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [X] Change Logs and Impact were stated clearly - [X] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947650248 ## CI report: * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Clarify config descriptions [hudi]
hudi-bot commented on PR #10681: URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947645303 ## CI report: * 5edfc37412400c5d01c154b122e7fd41491b7a86 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22470) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Clarify config descriptions [hudi]
hudi-bot commented on PR #10681: URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947640295 ## CI report: * 5edfc37412400c5d01c154b122e7fd41491b7a86 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Clarify config descriptions [hudi]
bhasudha opened a new pull request, #10681: URL: https://github.com/apache/hudi/pull/10681 This aligns with the doc change here: [10680]( https://github.com/apache/hudi/pull/10680) ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [DOCS] Clarify release notes on duplicate handling in Spark SQL and relevant configs (#10680)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 7f125b6f310 [DOCS] Clarify release notes on duplicate handling in Spark SQL and relevant configs (#10680) 7f125b6f310 is described below commit 7f125b6f3107fba9070f7e2c20fc58fbef564392 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Thu Feb 15 17:05:39 2024 -0800 [DOCS] Clarify release notes on duplicate handling in Spark SQL and relevant configs (#10680) --- website/docs/configurations.md | 104 ++--- website/releases/release-0.14.0.md | 8 +- .../version-0.14.0/configurations.md | 4 +- .../version-0.14.1/configurations.md | 4 +- 4 files changed, 62 insertions(+), 58 deletions(-) diff --git a/website/docs/configurations.md b/website/docs/configurations.md index 01ef8401954..18c3581e305 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -127,59 +127,59 @@ Options useful for writing tables via `write.format.option(...)` [**Advanced Configs**](#Write-Options-advanced-configs) -| Config Name | Default | Description [...] -| | | [...] -| [hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties) | (N/A) | Serde properties to hive table.`Config Param: HIVE_TABLE_SERDE_PROPERTIES` [...] -| [hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties) | (N/A) | Additional properties to store with table.`Config Param: HIVE_TABLE_PROPERTIES` [...] -| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode) | (N/A) | Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode`Config Param: OVERWRITE_MODE``Since Version: 0.14.0` [...] -| [hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete) | (N/A) | Comma separated list of partitions to delete. Allows use of wildcard *`Config Param: PARTITIONS_TO_DELETE` [...] -| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename) | (N/A) | Table name for the datasource write. Also used to register the table into meta stores.`Config Param: TABLE_NAME` [...] -| [hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable) | true
Re: [PR] [DOCS] Clarify release notes on duplicate handling in Spark SQL and r… [hudi]
bhasudha merged PR #10680: URL: https://github.com/apache/hudi/pull/10680 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7381] Fix flaky test introduced in PR 10619 (#10674)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new fe488bc1b64 [HUDI-7381] Fix flaky test introduced in PR 10619 (#10674) fe488bc1b64 is described below commit fe488bc1b649f1a9f90fcc178923ee12be3ce90f Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com> AuthorDate: Thu Feb 15 16:40:56 2024 -0800 [HUDI-7381] Fix flaky test introduced in PR 10619 (#10674) Co-authored-by: rmahindra123 --- .../table/action/compact/TestHoodieCompactor.java | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java index 313f14ce989..4ad19bfbfc4 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java @@ -195,19 +195,18 @@ public class TestHoodieCompactor extends HoodieSparkClientTestHarness { String newCommitTime = "100"; writeClient.startCommitWithTime(newCommitTime); - List records = dataGen.generateInserts(newCommitTime, 100); + List records = dataGen.generateInserts(newCommitTime, 1000); JavaRDD recordsRDD = jsc.parallelize(records, 1); writeClient.insert(recordsRDD, newCommitTime).collect(); - // Update all the 100 records - newCommitTime = "101"; - updateRecords(config, newCommitTime, records); - - assertLogFilesNumEqualsTo(config, 1); - - String compactionInstantTime = "102"; - HoodieData result = compact(writeClient, compactionInstantTime); - + // Update all the 1000 records across 5 commits to generate sufficient log files. + int i = 1; + for (; i < 5; i++) { +newCommitTime = String.format("10%s", i); +updateRecords(config, newCommitTime, records); +assertLogFilesNumEqualsTo(config, i); + } + HoodieData result = compact(writeClient, String.format("10%s", i)); verifyCompaction(result); // Verify compaction.requested, compaction.completed metrics counts. @@ -243,7 +242,6 @@ public class TestHoodieCompactor extends HoodieSparkClientTestHarness { assertLogFilesNumEqualsTo(config, 1); HoodieData result = compact(writeClient, "10" + (i + 1)); - verifyCompaction(result); // Verify compaction.requested, compaction.completed metrics counts. @@ -304,7 +302,6 @@ public class TestHoodieCompactor extends HoodieSparkClientTestHarness { for (String partitionPath : dataGen.getPartitionPaths()) { assertTrue(writeStatuses.stream().anyMatch(writeStatus -> writeStatus.getStat().getPartitionPath().contentEquals(partitionPath))); } - writeStatuses.forEach(writeStatus -> { final HoodieWriteStat.RuntimeStats stats = writeStatus.getStat().getRuntimeStats(); assertNotNull(stats);
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
yihua merged PR #10674: URL: https://github.com/apache/hudi/pull/10674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Fix zookeeper session expiration bug (#10671)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 7058d12e748 [MINOR] Fix zookeeper session expiration bug (#10671) 7058d12e748 is described below commit 7058d12e74832dc420975269b698782add5e4fff Author: Lin Liu <141371752+linliu-c...@users.noreply.github.com> AuthorDate: Thu Feb 15 16:38:29 2024 -0800 [MINOR] Fix zookeeper session expiration bug (#10671) --- .../TestDFSHoodieTestSuiteWriterAdapter.java | 2 +- .../integ/testsuite/TestFileDeltaInputWriter.java | 2 +- .../testsuite/job/TestHoodieTestSuiteJob.java | 3 +- .../reader/TestDFSAvroDeltaInputReader.java| 2 +- .../reader/TestDFSHoodieDatasetInputReader.java| 3 +- .../callback/TestKafkaCallbackProvider.java| 17 +++-- .../deltastreamer/HoodieDeltaStreamerTestBase.java | 13 +++ .../deltastreamer/TestHoodieDeltaStreamer.java | 4 +-- ...TestHoodieDeltaStreamerSchemaEvolutionBase.java | 1 - .../schema/TestFilebasedSchemaProvider.java| 2 +- .../utilities/sources/BaseTestKafkaSource.java | 14 .../utilities/sources/TestAvroKafkaSource.java | 17 + .../utilities/sources/TestSqlFileBasedSource.java | 40 ++ .../hudi/utilities/sources/TestSqlSource.java | 2 +- .../debezium/TestAbstractDebeziumSource.java | 18 -- .../sources/helpers/TestKafkaOffsetGen.java| 14 .../utilities/testutils/UtilitiesTestBase.java | 11 +- .../AbstractCloudObjectsSourceTestBase.java| 2 +- .../transform/TestSqlFileBasedTransformer.java | 36 ++- 19 files changed, 129 insertions(+), 74 deletions(-) diff --git a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java index 70430328553..f2ec458bf2d 100644 --- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java +++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java @@ -69,7 +69,7 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends UtilitiesTestBase { } @AfterAll - public static void cleanupClass() { + public static void cleanupClass() throws IOException { UtilitiesTestBase.cleanUpUtilitiesTestServices(); } diff --git a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java index 4f99292b3fd..d8e54984367 100644 --- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java +++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java @@ -63,7 +63,7 @@ public class TestFileDeltaInputWriter extends UtilitiesTestBase { } @AfterAll - public static void cleanupClass() { + public static void cleanupClass() throws IOException { UtilitiesTestBase.cleanUpUtilitiesTestServices(); } diff --git a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java index 087ffb8e400..9a4a2eee619 100644 --- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java +++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java @@ -49,6 +49,7 @@ import org.junit.jupiter.api.Test; import org.junit.jupiter.params.provider.Arguments; import org.junit.jupiter.params.provider.MethodSource; +import java.io.IOException; import java.util.UUID; import java.util.stream.Stream; @@ -134,7 +135,7 @@ public class TestHoodieTestSuiteJob extends UtilitiesTestBase { } @AfterAll - public static void cleanupClass() { + public static void cleanupClass() throws IOException { UtilitiesTestBase.cleanUpUtilitiesTestServices(); } diff --git a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java index 089a9d9fb55..8f93a82865a 100644 --- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java +++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java @@ -48,7 +48,7 @@ public class TestDFSAvroDeltaInputReader extends UtilitiesTestBase { } @AfterAll - public static void cleanupClass() { + public static void cleanupClass() throws IOException { UtilitiesTestBase.cleanUpUtilitiesTestServices();
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
vinothchandar merged PR #10671: URL: https://github.com/apache/hudi/pull/10671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947539677 ## CI report: * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465) * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]
yihua commented on code in PR #10591: URL: https://github.com/apache/hudi/pull/10591#discussion_r1491811838 ## hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java: ## @@ -432,9 +432,9 @@ private static String getTmpSerializerFile() { return TMP_DIR + UUID.randomUUID().toString() + ".ser"; } - private T deSerializeOperationResult(String inputP, FileSystem fs) throws Exception { -Path inputPath = new Path(inputP); -InputStream inputStream = fs.open(inputPath); + private T deSerializeOperationResult(HoodieLocation inputLocation, Review Comment: I think the renaming makes sense. I've addressed the renaming in #10672 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [DOCS] Clarify release notes on duplicate handling in Spark SQL and r… [hudi]
bhasudha opened a new pull request, #10680: URL: https://github.com/apache/hudi/pull/10680 …elevant configs ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947532554 ## CI report: * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465) * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Rename test class to TestHadoopStorageConfiguration (#10670)
This is an automated email from the ASF dual-hosted git repository. jonvex pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f5b9d071a62 [MINOR] Rename test class to TestHadoopStorageConfiguration (#10670) f5b9d071a62 is described below commit f5b9d071a6221a40ab54803c48a79d0b58d45f10 Author: Y Ethan Guo AuthorDate: Thu Feb 15 15:27:38 2024 -0800 [MINOR] Rename test class to TestHadoopStorageConfiguration (#10670) --- ...oopStorageConfiguration.java => TestHadoopStorageConfiguration.java} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java b/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java similarity index 92% rename from hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java rename to hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java index 5225c599fb4..79658ccc441 100644 --- a/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java +++ b/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java @@ -29,7 +29,7 @@ import java.util.Map; /** * Tests {@link HadoopStorageConfiguration}. */ -public class TestStorageConfigurationHadoopStorageConfiguration extends BaseTestStorageConfiguration { +public class TestHadoopStorageConfiguration extends BaseTestStorageConfiguration { @Override protected StorageConfiguration getStorageConfiguration(Configuration conf) { return new HadoopStorageConfiguration(conf);
Re: [PR] [MINOR] Rename test class to TestHadoopStorageConfiguration [hudi]
jonvex merged PR #10670: URL: https://github.com/apache/hudi/pull/10670 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
yihua merged PR #10673: URL: https://github.com/apache/hudi/pull/10673 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
yihua commented on code in PR #10673: URL: https://github.com/apache/hudi/pull/10673#discussion_r1491788224 ## hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java: ## @@ -238,7 +239,7 @@ private static HFileReader createReader(String hFilePath, FileSystem fileSystem) LOG.info("Opening HFile for reading :" + hFilePath); Path path = new Path(hFilePath); long fileSize = fileSystem.getFileStatus(path).getLen(); - FSDataInputStream stream = fileSystem.open(path); + SeekableDataInputStream stream = new HadoopSeekableDataInputStream(fileSystem.open(path)); Review Comment: This will be replaced by the new storage API call which returns `SeekableDataInputStream` directly. Hadoop is going to be fully removed here in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817816#comment-17817816 ] nadine edited comment on HUDI-7414 at 2/15/24 11:25 PM: DOCS- removed the sync base path config reference here: [https://github.com/apache/hudi/pull/10679/files] was (Author: JIRAUSER298226): removed the sync base path reference here: https://github.com/apache/hudi/pull/10679/files > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nadine >Assignee: nadine >Priority: Minor > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader (#10673)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 80f9f1ef36c [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader (#10673) 80f9f1ef36c is described below commit 80f9f1ef36c0e7953a13ee4b433a6afc623ad4cc Author: Y Ethan Guo AuthorDate: Thu Feb 15 15:26:02 2024 -0800 [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader (#10673) --- .../bootstrap/index/HFileBootstrapIndex.java | 5 ++- .../io/storage/HoodieNativeAvroHFileReader.java| 11 +++-- .../TestInLineFileSystemWithHFileReader.java | 8 ++-- .../hudi/io/ByteArraySeekableDataInputStream.java | 47 ++ .../org/apache/hudi/io/hfile/HFileBlockReader.java | 6 +-- .../org/apache/hudi/io/hfile/HFileReaderImpl.java | 8 ++-- .../org/apache/hudi/io/hfile/TestHFileReader.java | 38 + 7 files changed, 71 insertions(+), 52 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java index 989b0ad1e6d..7a6de5fe994 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java @@ -33,6 +33,8 @@ import org.apache.hudi.common.util.ValidationUtils; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream; +import org.apache.hudi.io.SeekableDataInputStream; import org.apache.hudi.io.hfile.HFileReader; import org.apache.hudi.io.hfile.HFileReaderImpl; import org.apache.hudi.io.hfile.Key; @@ -41,7 +43,6 @@ import org.apache.hudi.io.storage.HoodieHFileUtils; import org.apache.hudi.io.util.IOUtils; import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.CellComparatorImpl; @@ -238,7 +239,7 @@ public class HFileBootstrapIndex extends BootstrapIndex { LOG.info("Opening HFile for reading :" + hFilePath); Path path = new Path(hFilePath); long fileSize = fileSystem.getFileStatus(path).getLen(); - FSDataInputStream stream = fileSystem.open(path); + SeekableDataInputStream stream = new HadoopSeekableDataInputStream(fileSystem.open(path)); return new HFileReaderImpl(stream, fileSize); } diff --git a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java index cc3833996b9..e760b33b9e2 100644 --- a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java +++ b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java @@ -28,9 +28,13 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.ClosableIterator; import org.apache.hudi.common.util.collection.CloseableMappingIterator; import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.common.util.io.ByteBufferBackedInputStream; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.hadoop.fs.HadoopFSUtils; +import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream; +import org.apache.hudi.io.ByteArraySeekableDataInputStream; +import org.apache.hudi.io.SeekableDataInputStream; import org.apache.hudi.io.hfile.HFileReader; import org.apache.hudi.io.hfile.HFileReaderImpl; import org.apache.hudi.io.hfile.KeyValue; @@ -41,7 +45,6 @@ import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; import org.apache.avro.generic.IndexedRecord; import org.apache.hadoop.conf.Configuration; -import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.slf4j.Logger; @@ -256,15 +259,15 @@ public class HoodieNativeAvroHFileReader extends HoodieAvroHFileReaderImplBase { } private HFileReader newHFileReader() throws IOException { -FSDataInputStream inputStream; +SeekableDataInputStream inputStream; long fileSize; if (path.isPresent()) { FileSystem fs = HadoopFSUtils.getFs(path.get(), conf); fileSize = fs.getFileStatus(path.get()).getLen(); - inputStream = fs.open(path.get()); + inputStream = new HadoopSeekableDataInputStream(fs.open(path.get())); } else { fileSize = bytesContent.get().length; - inputStream = new FSDataInputStream(new Seekab
[jira] [Commented] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817816#comment-17817816 ] nadine commented on HUDI-7414: -- removed the sync base path reference here: https://github.com/apache/hudi/pull/10679/files > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nadine >Assignee: nadine >Priority: Minor > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
nadine created HUDI-7414: Summary: Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs Key: HUDI-7414 URL: https://issues.apache.org/jira/browse/HUDI-7414 Project: Apache Hudi Issue Type: Improvement Reporter: nadine Assignee: nadine There was a jira issue filed where sarfaraz wanted to know more about the `hoodie.gcp.bigquery.sync.base_path`. In the BigQuerySyncConfig file, there a config property set: [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] But it’s not used anywhere else in the big query code base. However, I see [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} is superfluous. I’m seeing as a config being set, but not being used anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7289) Fix parameters for Big Query Sync
[ https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817810#comment-17817810 ] nadine commented on HUDI-7289: -- updated the hoodie.gcp.bigquery.sync.require_partition_filter config [https://github.com/apache/hudi/pull/10679/files] The {{hoodie.gcp.bigquery.sync.base_path}} is superfluous - outside of it being declared, it's not being used. I removed the reference in the gcp doc- and will update the code base to remove reference. > Fix parameters for Big Query Sync > - > > Key: HUDI-7289 > URL: https://issues.apache.org/jira/browse/HUDI-7289 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Aditya Goenka >Assignee: nadine >Priority: Minor > Fix For: 1.1.0 > > > revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/] > > From a user - > Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is > missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part > of Hudi 0.14.1. > Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very > clear, even the example is not understandable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] DOCS-updated gcp config doc [hudi]
nfarah86 opened a new pull request, #10679: URL: https://github.com/apache/hudi/pull/10679 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ updated `hoodie.gcp.bigquery.sync.require_partition_filter` and removed `hoodie.gcp.bigquery.sync.base_path` ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update updated https://hudi.apache.org/docs/next/gcp_bigquery _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed https://github.com/apache/hudi/assets/5392555/37aa9a50-b265-4771-9126-ac1d971e98e8";> @xushiyan please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947489426 ## CI report: * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947474263 ## CI report: * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463) * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add parallel listing of existing partitions [hudi]
VitoMakarevich commented on PR #10460: URL: https://github.com/apache/hudi/pull/10460#issuecomment-1947454649 @yihua @nsivabalan is there any chance you'll be able to take a look on it? It's a significant improvement and makes sync much faster... We've been running it in production for a month already and there are no issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947416216 ## CI report: * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463) * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947416115 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460) * 1ec69433726191ec77a14b37719709d21e35059a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947407475 ## CI report: * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463) * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947407347 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460) * 1ec69433726191ec77a14b37719709d21e35059a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947397878 ## CI report: * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1947397836 ## CI report: * 457f187f06803c99276135cc8e175df2b14386ba Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22461) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947338993 ## CI report: * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947329593 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947321137 ## CI report: * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947161411 ## CI report: * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459) * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Setting hoodie.datasource.insert.dup.policy to drop still upserts the record in 0.14 [hudi]
keerthiskating commented on issue #10650: URL: https://github.com/apache/hudi/issues/10650#issuecomment-1947116699 @ad1happy2go I do not have the bandwidth to contribute. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947005128 ## CI report: * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459) * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1947004908 ## CI report: * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455) * 457f187f06803c99276135cc8e175df2b14386ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22461) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947004739 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453) * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946979094 ## CI report: * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1946978954 ## CI report: * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455) * 457f187f06803c99276135cc8e175df2b14386ba UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1946978665 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453) * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Can't read a table with timestamp based partition key generator [hudi]
ofinchuk-bloomberg opened a new issue, #10678: URL: https://github.com/apache/hudi/issues/10678 Can't read a table which was created using TimestampBasedKeyGenerator Or CustomKeyGenerator for timestamp partition. Issue is that `ts` remains Long type, while _hoodie_partition_path is formed as a String, so Simple operation to read doesn't work and throws Exception **To Reproduce** Steps to reproduce the behavior: ` import org.apache.spark.sql.{SaveMode, SparkSession} object SprkDemo { def main(args:Array[String]): Unit ={ val spark = SparkSession.builder() .master("local[1]") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") .appName("SparkByExample") .getOrCreate(); import spark.implicits._ spark.createDataset(List(("id1","name1", System.currentTimeMillis()), ("id2","name2",(System.currentTimeMillis()+1)) )) .toDF("id","name","ts") .write .format("hudi") .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.CustomKeyGenerator") .option("hoodie.datasource.write.partitionpath.field", "ts:timestamp") .option("hoodie.datasource.write.recordkey.field", "id") .option("hoodie.datasource.write.precombined.field", "name") .option("hoodie.table.name", "hudi_cow2") .option("hoodie.keygen.timebased.timestamp.type", "EPOCHMILLISECONDS") .option("hoodie.keygen.timebased.output.dateformat", "MMdd-HH") .mode(SaveMode.Overwrite) .save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2") spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*") .show() spark.read.format("hudi") .option("hoodie.schema.on.read.enable","true") .load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/") .show() } } ` when reading parquet I see next data: ` +---++--+--++---+-+-++ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id| name| ts|date| +---++--+--++---+-+-++ | 20240214184652987|20240214184652987...| id1| 20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00| | 20240214184652987|20240214184652987...| id2| 20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01| +---++--+--++---+-+-++ ` **Expected behavior** Table should be read successfully into spark dataframe **Environment Description** I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment * Running on Docker? (yes/no) :no **Stacktrace** ``` Exception in thread "main" java.lang.RuntimeException: Failed to cast value '20240214-18' to 'LongType' for partition column 'ts' at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69) at org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280) at org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264) at org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401) at org.
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
rmahindra123 commented on code in PR #10674: URL: https://github.com/apache/hudi/pull/10674#discussion_r1491428474 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java: ## @@ -195,19 +195,18 @@ public void testWriteStatusContentsAfterCompaction() throws Exception { String newCommitTime = "100"; writeClient.startCommitWithTime(newCommitTime); - List records = dataGen.generateInserts(newCommitTime, 100); + List records = dataGen.generateInserts(newCommitTime, 1000); JavaRDD recordsRDD = jsc.parallelize(records, 1); writeClient.insert(recordsRDD, newCommitTime).collect(); // Update all the 100 records Review Comment: The idea is to ensure that the scan times are in milliseconds as opposed to microsecs, so increasing the load. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946814107 ## CI report: * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]
hudi-bot commented on PR #10676: URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946813984 ## CI report: * d5f38b26cede75b6d07367cb661f0fd20256e3e0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22457) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users
[ https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7413: - Labels: pull-request-available (was: ) > Make Issues with schema easier to understand for users > -- > > Key: HUDI-7413 > URL: https://issues.apache.org/jira/browse/HUDI-7413 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Provide exceptions that classify issues with schema. Additionally, provide > users with a clear explanation of what is wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7413] make schema errors better [hudi]
hudi-bot commented on PR #10677: URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946787792 ## CI report: * 1b911520a42e187c1e4bb33345c630a1866bd375 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users
[ https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7413: -- Status: Patch Available (was: In Progress) > Make Issues with schema easier to understand for users > -- > > Key: HUDI-7413 > URL: https://issues.apache.org/jira/browse/HUDI-7413 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > > Provide exceptions that classify issues with schema. Additionally, provide > users with a clear explanation of what is wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users
[ https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7413: -- Status: In Progress (was: Open) > Make Issues with schema easier to understand for users > -- > > Key: HUDI-7413 > URL: https://issues.apache.org/jira/browse/HUDI-7413 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > > Provide exceptions that classify issues with schema. Additionally, provide > users with a clear explanation of what is wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)
[ https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817743#comment-17817743 ] Haitham Eltaweel edited comment on HUDI-7412 at 2/15/24 5:36 PM: - Update: the same error (OOM) also occurs when writing the DF using parquet format. Find a snapshot from Spark UI below: !image-2024-02-15-11-35-19-156.png! was (Author: JIRAUSER301642): Update: the same OOM error also occurs when writing using parquet format. Find a snapshot from Spark UI below: !image-2024-02-15-11-35-19-156.png! > OOM error after upgrade to hudi 0.13 when writing big record (stream or batch > job) > -- > > Key: HUDI-7412 > URL: https://issues.apache.org/jira/browse/HUDI-7412 > Project: Apache Hudi > Issue Type: Bug > Components: spark > Environment: Amazon EMR version emr-6.11.1 > Spark version 3.3.2 > Hive version 3.1.3 > Hadoop version 3.3.3 > hudi version 0.13 >Reporter: Haitham Eltaweel >Priority: Major > Attachments: image-2024-02-15-11-35-19-156.png > > > After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) > can not be written to the destination location due to OOM error even after > increasing Spark resources memory. > Find the error details: java.lang.OutOfMemoryError: Java heap space. > The error never happened when running same jab using hudi 0.11. > Find below the use case details: > Read one json file which has one record of 900MB from S3 source location, > transform the DF then write the output DF to S3 target location. When using > upsert hudi operation, the error happens at Tagging job ([mapToPair at > HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0]) > and when using insert hudi operation, the error happens at Building workload > profile job. The error happens whether I run the job as Spark structured > streaming job or batch job. > Find the batch job code snippet shared below. I obfuscated some values. > from pyspark.sql import functions as f > from pyspark.sql import SparkSession > from pyspark.sql.types import * > > def main(): > > hudi_options = { > 'hoodie.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', > 'hoodie.datasource.write.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.CustomKeyGenerator', > 'hoodie.datasource.write.recordkey.field': 'id', > 'hoodie.datasource.write.precombine.field': 'ts', > 'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE', > 'hoodie.embed.timeline.server': False, > 'hoodie.index.type': 'SIMPLE', > 'hoodie.parquet.compression.codec': 'snappy', > 'hoodie.clean.async': True, > 'hoodie.parquet.max.file.size': 125829120, > 'hoodie.parquet.small.file.limit': 104857600, > 'hoodie.parquet.block.size': 125829120, > 'hoodie.metadata.enable': True, > 'hoodie.metadata.validate': True, > "hoodie.datasource.write.hive_style_partitioning": True, > 'hoodie.datasource.hive_sync.support_timestamp': True, > "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x", > 'hoodie.datasource.hive_sync.username': 'xxx', > 'hoodie.datasource.hive_sync.password': 'xxx', > "hoodie.datasource.hive_sync.database": "xxx", > "hoodie.datasource.hive_sync.table": "hudi_streaming_reco", > "hoodie.datasource.hive_sync.partition_fields": "insert_hr", > "hoodie.datasource.hive_sync.enable": True, > 'hoodie.datasource.hive_sync.partition_extractor_class': > 'org.apache.hudi.hive.MultiPartKeysValueExtractor' > } > > spark=SparkSession.builder.getOrCreate() > > inputPath = "s3://xxx/" > > transfomredDF = ( > spark > .read > .text(inputPath, wholetext=True) > .select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.get_json_object(f.col("value"), "$._id").alias("id"), > f.get_json_object(f.col("value"), > "$.metadata.createdDateTime").alias("ts"), > f.input_file_name().alias("input_file_name")) > ) > > > > s3_output_path = "s3://xxx/" > transfomredDF \ > .write.format("hudi") \ > .options(**hudi_options) \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .save(s3_output_path,mode='append') > > if __name__ == "__main__": > main() > > Find the spark su
[jira] [Created] (HUDI-7413) Make Issues with schema easier to understand for users
Jonathan Vexler created HUDI-7413: - Summary: Make Issues with schema easier to understand for users Key: HUDI-7413 URL: https://issues.apache.org/jira/browse/HUDI-7413 Project: Apache Hudi Issue Type: Improvement Reporter: Jonathan Vexler Assignee: Jonathan Vexler Provide exceptions that classify issues with schema. Additionally, provide users with a clear explanation of what is wrong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)
[ https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817743#comment-17817743 ] Haitham Eltaweel commented on HUDI-7412: Update: the same OOM error also occurs when writing using parquet format. Find a snapshot from Spark UI below: !image-2024-02-15-11-35-19-156.png! > OOM error after upgrade to hudi 0.13 when writing big record (stream or batch > job) > -- > > Key: HUDI-7412 > URL: https://issues.apache.org/jira/browse/HUDI-7412 > Project: Apache Hudi > Issue Type: Bug > Components: spark > Environment: Amazon EMR version emr-6.11.1 > Spark version 3.3.2 > Hive version 3.1.3 > Hadoop version 3.3.3 > hudi version 0.13 >Reporter: Haitham Eltaweel >Priority: Major > Attachments: image-2024-02-15-11-35-19-156.png > > > After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) > can not be written to the destination location due to OOM error even after > increasing Spark resources memory. > Find the error details: java.lang.OutOfMemoryError: Java heap space. > The error never happened when running same jab using hudi 0.11. > Find below the use case details: > Read one json file which has one record of 900MB from S3 source location, > transform the DF then write the output DF to S3 target location. When using > upsert hudi operation, the error happens at Tagging job ([mapToPair at > HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0]) > and when using insert hudi operation, the error happens at Building workload > profile job. The error happens whether I run the job as Spark structured > streaming job or batch job. > Find the batch job code snippet shared below. I obfuscated some values. > from pyspark.sql import functions as f > from pyspark.sql import SparkSession > from pyspark.sql.types import * > > def main(): > > hudi_options = { > 'hoodie.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', > 'hoodie.datasource.write.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.CustomKeyGenerator', > 'hoodie.datasource.write.recordkey.field': 'id', > 'hoodie.datasource.write.precombine.field': 'ts', > 'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE', > 'hoodie.embed.timeline.server': False, > 'hoodie.index.type': 'SIMPLE', > 'hoodie.parquet.compression.codec': 'snappy', > 'hoodie.clean.async': True, > 'hoodie.parquet.max.file.size': 125829120, > 'hoodie.parquet.small.file.limit': 104857600, > 'hoodie.parquet.block.size': 125829120, > 'hoodie.metadata.enable': True, > 'hoodie.metadata.validate': True, > "hoodie.datasource.write.hive_style_partitioning": True, > 'hoodie.datasource.hive_sync.support_timestamp': True, > "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x", > 'hoodie.datasource.hive_sync.username': 'xxx', > 'hoodie.datasource.hive_sync.password': 'xxx', > "hoodie.datasource.hive_sync.database": "xxx", > "hoodie.datasource.hive_sync.table": "hudi_streaming_reco", > "hoodie.datasource.hive_sync.partition_fields": "insert_hr", > "hoodie.datasource.hive_sync.enable": True, > 'hoodie.datasource.hive_sync.partition_extractor_class': > 'org.apache.hudi.hive.MultiPartKeysValueExtractor' > } > > spark=SparkSession.builder.getOrCreate() > > inputPath = "s3://xxx/" > > transfomredDF = ( > spark > .read > .text(inputPath, wholetext=True) > .select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.get_json_object(f.col("value"), "$._id").alias("id"), > f.get_json_object(f.col("value"), > "$.metadata.createdDateTime").alias("ts"), > f.input_file_name().alias("input_file_name")) > ) > > > > s3_output_path = "s3://xxx/" > transfomredDF \ > .write.format("hudi") \ > .options(**hudi_options) \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .save(s3_output_path,mode='append') > > if __name__ == "__main__": > main() > > Find the spark submit command used : > spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf > spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension > --conf spark.serializer=org.apache.spark.serializer.KryoS
[jira] [Updated] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)
[ https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haitham Eltaweel updated HUDI-7412: --- Attachment: image-2024-02-15-11-35-19-156.png > OOM error after upgrade to hudi 0.13 when writing big record (stream or batch > job) > -- > > Key: HUDI-7412 > URL: https://issues.apache.org/jira/browse/HUDI-7412 > Project: Apache Hudi > Issue Type: Bug > Components: spark > Environment: Amazon EMR version emr-6.11.1 > Spark version 3.3.2 > Hive version 3.1.3 > Hadoop version 3.3.3 > hudi version 0.13 >Reporter: Haitham Eltaweel >Priority: Major > Attachments: image-2024-02-15-11-35-19-156.png > > > After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) > can not be written to the destination location due to OOM error even after > increasing Spark resources memory. > Find the error details: java.lang.OutOfMemoryError: Java heap space. > The error never happened when running same jab using hudi 0.11. > Find below the use case details: > Read one json file which has one record of 900MB from S3 source location, > transform the DF then write the output DF to S3 target location. When using > upsert hudi operation, the error happens at Tagging job ([mapToPair at > HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0]) > and when using insert hudi operation, the error happens at Building workload > profile job. The error happens whether I run the job as Spark structured > streaming job or batch job. > Find the batch job code snippet shared below. I obfuscated some values. > from pyspark.sql import functions as f > from pyspark.sql import SparkSession > from pyspark.sql.types import * > > def main(): > > hudi_options = { > 'hoodie.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', > 'hoodie.datasource.write.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.CustomKeyGenerator', > 'hoodie.datasource.write.recordkey.field': 'id', > 'hoodie.datasource.write.precombine.field': 'ts', > 'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE', > 'hoodie.embed.timeline.server': False, > 'hoodie.index.type': 'SIMPLE', > 'hoodie.parquet.compression.codec': 'snappy', > 'hoodie.clean.async': True, > 'hoodie.parquet.max.file.size': 125829120, > 'hoodie.parquet.small.file.limit': 104857600, > 'hoodie.parquet.block.size': 125829120, > 'hoodie.metadata.enable': True, > 'hoodie.metadata.validate': True, > "hoodie.datasource.write.hive_style_partitioning": True, > 'hoodie.datasource.hive_sync.support_timestamp': True, > "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x", > 'hoodie.datasource.hive_sync.username': 'xxx', > 'hoodie.datasource.hive_sync.password': 'xxx', > "hoodie.datasource.hive_sync.database": "xxx", > "hoodie.datasource.hive_sync.table": "hudi_streaming_reco", > "hoodie.datasource.hive_sync.partition_fields": "insert_hr", > "hoodie.datasource.hive_sync.enable": True, > 'hoodie.datasource.hive_sync.partition_extractor_class': > 'org.apache.hudi.hive.MultiPartKeysValueExtractor' > } > > spark=SparkSession.builder.getOrCreate() > > inputPath = "s3://xxx/" > > transfomredDF = ( > spark > .read > .text(inputPath, wholetext=True) > .select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.get_json_object(f.col("value"), "$._id").alias("id"), > f.get_json_object(f.col("value"), > "$.metadata.createdDateTime").alias("ts"), > f.input_file_name().alias("input_file_name")) > ) > > > > s3_output_path = "s3://xxx/" > transfomredDF \ > .write.format("hudi") \ > .options(**hudi_options) \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .save(s3_output_path,mode='append') > > if __name__ == "__main__": > main() > > Find the spark submit command used : > spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf > spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.kryoserializer.buffer.max=512 --num-executors 5 --executor-cores 3 > --executor-memory 10g --driver-memory 30g --name big_file_bat
[PR] make schema errors better [hudi]
jonvex opened a new pull request, #10677: URL: https://github.com/apache/hudi/pull/10677 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big file (stream or batch job)
Haitham Eltaweel created HUDI-7412: -- Summary: OOM error after upgrade to hudi 0.13 when writing big file (stream or batch job) Key: HUDI-7412 URL: https://issues.apache.org/jira/browse/HUDI-7412 Project: Apache Hudi Issue Type: Bug Components: spark Environment: Amazon EMR version emr-6.11.1 Spark version 3.3.2 Hive version 3.1.3 Hadoop version 3.3.3 hudi version 0.13 Reporter: Haitham Eltaweel After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) can not be written to the destination location due to OOM error even after increasing Spark resources memory. Find the error details: java.lang.OutOfMemoryError: Java heap space. The error never happened when running same jab using hudi 0.11. Find below the use case details: Read one json file which has one record of 900MB from S3 source location, transform the DF then write the output DF to S3 target location. When using upsert hudi operation, the error happens at Tagging job ([mapToPair at HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0]) and when using insert hudi operation, the error happens at Building workload profile job. The error happens whether I run the job as Spark structured streaming job or batch job. Find the batch job code snippet shared below. I obfuscated some values. from pyspark.sql import functions as f from pyspark.sql import SparkSession from pyspark.sql.types import * def main(): hudi_options = { 'hoodie.table.name': 'hudi_streaming_reco', 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', 'hoodie.datasource.write.table.name': 'hudi_streaming_reco', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE', 'hoodie.embed.timeline.server': False, 'hoodie.index.type': 'SIMPLE', 'hoodie.parquet.compression.codec': 'snappy', 'hoodie.clean.async': True, 'hoodie.parquet.max.file.size': 125829120, 'hoodie.parquet.small.file.limit': 104857600, 'hoodie.parquet.block.size': 125829120, 'hoodie.metadata.enable': True, 'hoodie.metadata.validate': True, "hoodie.datasource.write.hive_style_partitioning": True, 'hoodie.datasource.hive_sync.support_timestamp': True, "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x", 'hoodie.datasource.hive_sync.username': 'xxx', 'hoodie.datasource.hive_sync.password': 'xxx', "hoodie.datasource.hive_sync.database": "xxx", "hoodie.datasource.hive_sync.table": "hudi_streaming_reco", "hoodie.datasource.hive_sync.partition_fields": "insert_hr", "hoodie.datasource.hive_sync.enable": True, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor' } spark=SparkSession.builder.getOrCreate() inputPath = "s3://xxx/" transfomredDF = ( spark .read .text(inputPath, wholetext=True) .select(f.date_format(f.current_timestamp(), 'MMddHH').astype('string').alias('insert_hr'), f.col("value").alias("raw_data"), f.get_json_object(f.col("value"), "$._id").alias("id"), f.get_json_object(f.col("value"), "$.metadata.createdDateTime").alias("ts"), f.input_file_name().alias("input_file_name")) ) s3_output_path = "s3://xxx/" transfomredDF \ .write.format("hudi") \ .options(**hudi_options) \ .option('hoodie.datasource.write.operation', 'upsert') \ .save(s3_output_path,mode='append') if __name__ == "__main__": main() Find the spark submit command used : spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=512 --num-executors 5 --executor-cores 3 --executor-memory 10g --driver-memory 30g --name big_file_batch --queue casualty --deploy-mode cluster big_record_test.py -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)
[ https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haitham Eltaweel updated HUDI-7412: --- Summary: OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job) (was: OOM error after upgrade to hudi 0.13 when writing big file (stream or batch job)) > OOM error after upgrade to hudi 0.13 when writing big record (stream or batch > job) > -- > > Key: HUDI-7412 > URL: https://issues.apache.org/jira/browse/HUDI-7412 > Project: Apache Hudi > Issue Type: Bug > Components: spark > Environment: Amazon EMR version emr-6.11.1 > Spark version 3.3.2 > Hive version 3.1.3 > Hadoop version 3.3.3 > hudi version 0.13 >Reporter: Haitham Eltaweel >Priority: Major > > After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) > can not be written to the destination location due to OOM error even after > increasing Spark resources memory. > Find the error details: java.lang.OutOfMemoryError: Java heap space. > The error never happened when running same jab using hudi 0.11. > Find below the use case details: > Read one json file which has one record of 900MB from S3 source location, > transform the DF then write the output DF to S3 target location. When using > upsert hudi operation, the error happens at Tagging job ([mapToPair at > HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0]) > and when using insert hudi operation, the error happens at Building workload > profile job. The error happens whether I run the job as Spark structured > streaming job or batch job. > Find the batch job code snippet shared below. I obfuscated some values. > from pyspark.sql import functions as f > from pyspark.sql import SparkSession > from pyspark.sql.types import * > > def main(): > > hudi_options = { > 'hoodie.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', > 'hoodie.datasource.write.table.name': 'hudi_streaming_reco', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.CustomKeyGenerator', > 'hoodie.datasource.write.recordkey.field': 'id', > 'hoodie.datasource.write.precombine.field': 'ts', > 'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE', > 'hoodie.embed.timeline.server': False, > 'hoodie.index.type': 'SIMPLE', > 'hoodie.parquet.compression.codec': 'snappy', > 'hoodie.clean.async': True, > 'hoodie.parquet.max.file.size': 125829120, > 'hoodie.parquet.small.file.limit': 104857600, > 'hoodie.parquet.block.size': 125829120, > 'hoodie.metadata.enable': True, > 'hoodie.metadata.validate': True, > "hoodie.datasource.write.hive_style_partitioning": True, > 'hoodie.datasource.hive_sync.support_timestamp': True, > "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x", > 'hoodie.datasource.hive_sync.username': 'xxx', > 'hoodie.datasource.hive_sync.password': 'xxx', > "hoodie.datasource.hive_sync.database": "xxx", > "hoodie.datasource.hive_sync.table": "hudi_streaming_reco", > "hoodie.datasource.hive_sync.partition_fields": "insert_hr", > "hoodie.datasource.hive_sync.enable": True, > 'hoodie.datasource.hive_sync.partition_extractor_class': > 'org.apache.hudi.hive.MultiPartKeysValueExtractor' > } > > spark=SparkSession.builder.getOrCreate() > > inputPath = "s3://xxx/" > > transfomredDF = ( > spark > .read > .text(inputPath, wholetext=True) > .select(f.date_format(f.current_timestamp(), > 'MMddHH').astype('string').alias('insert_hr'), > f.col("value").alias("raw_data"), > f.get_json_object(f.col("value"), "$._id").alias("id"), > f.get_json_object(f.col("value"), > "$.metadata.createdDateTime").alias("ts"), > f.input_file_name().alias("input_file_name")) > ) > > > > s3_output_path = "s3://xxx/" > transfomredDF \ > .write.format("hudi") \ > .options(**hudi_options) \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .save(s3_output_path,mode='append') > > if __name__ == "__main__": > main() > > Find the spark submit command used : > spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf > spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.kryoserializer.buffer.max=512 --num-executors 5 --exe
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
linliu-code commented on code in PR #10674: URL: https://github.com/apache/hudi/pull/10674#discussion_r1491251501 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java: ## @@ -195,19 +195,18 @@ public void testWriteStatusContentsAfterCompaction() throws Exception { String newCommitTime = "100"; writeClient.startCommitWithTime(newCommitTime); - List records = dataGen.generateInserts(newCommitTime, 100); + List records = dataGen.generateInserts(newCommitTime, 1000); JavaRDD recordsRDD = jsc.parallelize(records, 1); writeClient.insert(recordsRDD, newCommitTime).collect(); // Update all the 100 records Review Comment: 100 -> 1000? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
jonvex commented on code in PR #10673: URL: https://github.com/apache/hudi/pull/10673#discussion_r1491247516 ## hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java: ## @@ -238,7 +239,7 @@ private static HFileReader createReader(String hFilePath, FileSystem fileSystem) LOG.info("Opening HFile for reading :" + hFilePath); Path path = new Path(hFilePath); long fileSize = fileSystem.getFileStatus(path).getLen(); - FSDataInputStream stream = fileSystem.open(path); + SeekableDataInputStream stream = new HadoopSeekableDataInputStream(fileSystem.open(path)); Review Comment: Is this going to be HadoopSeekableDataInputStream going forward? Or is hadoop going to be fully removed from here at some point? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]
hudi-bot commented on PR #10676: URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946401504 ## CI report: * d5f38b26cede75b6d07367cb661f0fd20256e3e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22457) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Athena does not support s3a partition scheme anymore leading to missing data [hudi]
codope closed issue #10595: [SUPPORT] Athena does not support s3a partition scheme anymore leading to missing data URL: https://github.com/apache/hudi/issues/10595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7362] Fix hudi partition base path scheme to s3 (#10596)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 6e6d66a7097 [HUDI-7362] Fix hudi partition base path scheme to s3 (#10596) 6e6d66a7097 is described below commit 6e6d66a70973a78d4f155bf13860c65565402930 Author: Nicolas Paris AuthorDate: Thu Feb 15 16:55:27 2024 +0100 [HUDI-7362] Fix hudi partition base path scheme to s3 (#10596) --- .../main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java index b814e353583..1e19b44a499 100644 --- a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java +++ b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java @@ -199,7 +199,7 @@ public class AWSGlueCatalogSyncClient extends HoodieSyncClient { Table table = getTable(awsGlue, databaseName, tableName); StorageDescriptor sd = table.storageDescriptor(); List partitionInputs = partitionsToAdd.stream().map(partition -> { -String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), partition).toString(); +String fullPartitionPath = FSUtils.getPartitionPath(s3aToS3(getBasePath()), partition).toString(); List partitionValues = partitionValueExtractor.extractPartitionValuesInPath(partition); StorageDescriptor partitionSD = sd.copy(copySd -> copySd.location(fullPartitionPath)); return PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build(); @@ -242,7 +242,7 @@ public class AWSGlueCatalogSyncClient extends HoodieSyncClient { Table table = getTable(awsGlue, databaseName, tableName); StorageDescriptor sd = table.storageDescriptor(); List updatePartitionEntries = changedPartitions.stream().map(partition -> { -String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), partition).toString(); +String fullPartitionPath = FSUtils.getPartitionPath(s3aToS3(getBasePath()), partition).toString(); List partitionValues = partitionValueExtractor.extractPartitionValuesInPath(partition); StorageDescriptor partitionSD = sd.copy(copySd -> copySd.location(fullPartitionPath)); PartitionInput partitionInput = PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build();
Re: [PR] [HUDI-7362] Fix hudi partition base path scheme to s3 [hudi]
codope merged PR #10596: URL: https://github.com/apache/hudi/pull/10596 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]
hudi-bot commented on PR #10676: URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946386360 ## CI report: * d5f38b26cede75b6d07367cb661f0fd20256e3e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7411) Meta sync does not consider clean commits while syncing partitions
[ https://issues.apache.org/jira/browse/HUDI-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7411: - Labels: pull-request-available (was: ) > Meta sync does not consider clean commits while syncing partitions > -- > > Key: HUDI-7411 > URL: https://issues.apache.org/jira/browse/HUDI-7411 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.2 > > > Cleaner could not delete partitions but meta sync fails to drop partition in > that case. This could cause query using engines that depend on catalog to > fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]
codope opened a new pull request, #10676: URL: https://github.com/apache/hudi/pull/10676 ### Change Logs Cleaner could not delete partitions but meta sync fails to drop partition in that case. This could cause query using engines that depend on catalog to fail. TODO: I have only tested locally. I am going to add a test. ### Impact Catalog will reflect correct partition metadata considering cleaner commits. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7411) Meta sync does not consider clean commits while syncing partitions
Sagar Sumit created HUDI-7411: - Summary: Meta sync does not consider clean commits while syncing partitions Key: HUDI-7411 URL: https://issues.apache.org/jira/browse/HUDI-7411 Project: Apache Hudi Issue Type: Task Components: meta-sync Reporter: Sagar Sumit Fix For: 1.0.0, 0.14.2 Cleaner could not delete partitions but meta sync fails to drop partition in that case. This could cause query using engines that depend on catalog to fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning (#10651)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f29811b1a4c [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning (#10651) f29811b1a4c is described below commit f29811b1a4ca9121a5124d63ded147dba7b90b93 Author: Sivabalan Narayanan AuthorDate: Thu Feb 15 05:16:41 2024 -0800 [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning (#10651) * Fixing incremental cleaning with savepoint * Addressing feedback --- .../table/action/clean/CleanActionExecutor.java| 3 +- .../action/clean/CleanPlanActionExecutor.java | 12 +- .../hudi/table/action/clean/CleanPlanner.java | 116 -- .../apache/hudi/table/action/TestCleanPlanner.java | 247 - .../hudi/utils/TestMetadataConversionUtils.java| 4 +- .../functional/TestExternalPathHandling.java | 5 +- .../java/org/apache/hudi/table/TestCleaner.java| 7 +- .../testutils/HoodieSparkClientTestHarness.java| 4 +- hudi-common/src/main/avro/HoodieCleanMetadata.avsc | 11 +- hudi-common/src/main/avro/HoodieCleanerPlan.avsc | 11 +- .../clean/CleanPlanV1MigrationHandler.java | 3 +- .../clean/CleanPlanV2MigrationHandler.java | 3 +- .../org/apache/hudi/common/util/CleanerUtils.java | 5 +- .../table/view/TestIncrementalFSViewSync.java | 2 +- .../hudi/common/testutils/HoodieTestTable.java | 8 +- .../hudi/common/util/TestClusteringUtils.java | 6 +- 16 files changed, 395 insertions(+), 52 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java index 40d91b63394..61c0eeeffb0 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java @@ -219,7 +219,8 @@ public class CleanActionExecutor extends BaseActionExecutor extends BaseActionExecutor> { private static final Logger LOG = LoggerFactory.getLogger(CleanPlanActionExecutor.class); - private final Option> extraMetadata; public CleanPlanActionExecutor(HoodieEngineContext context, @@ -142,12 +142,20 @@ public class CleanPlanActionExecutor extends BaseActionExecutor new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())).orElse(null), planner.getLastCompletedCommitTimestamp(), config.getCleanerPolicy().name(), Collections.emptyMap(), - CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, partitionsToDelete); + CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps())); } catch (IOException e) { throw new HoodieIOException("Failed to schedule clean operation", e); } } + private Map prepareExtraMetadata(List savepointedTimestamps) { +if (savepointedTimestamps.isEmpty()) { + return Collections.emptyMap(); +} else { + return Collections.singletonMap(SAVEPOINTED_TIMESTAMPS, savepointedTimestamps.stream().collect(Collectors.joining(","))); +} + } + /** * Creates a Cleaner plan if there are files to be cleaned and stores them in instant file. * Cleaner Plan contains absolute file paths. diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java index 0dd516a88d1..19cbe0f91a7 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java @@ -41,6 +41,7 @@ import org.apache.hudi.common.table.view.HoodieTableFileSystemView; import org.apache.hudi.common.table.view.SyncableFileSystemView; import org.apache.hudi.common.util.CleanerUtils; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieIOException; @@ -55,6 +56,7 @@ import java.io.IOException; import java.io.Serializable; import java.time.Instant; import java.util.ArrayList; +import java.util.Arrays; import java.util.Collections; import java.util.Iterator; import java.util.List; @@ -78,6 +80,7 @@ public class CleanPlanner implements Serializable { public static final Integer CLEAN_PLAN_VERSION_1 = CleanPlanV1MigrationHandler.VERSION;
Re: [PR] [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning [hudi]
codope merged PR #10651: URL: https://github.com/apache/hudi/pull/10651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Unable to insert record into Hudi table using Hudi Spark Connector through Golang [hudi]
Shekkylar opened a new issue, #10675: URL: https://github.com/apache/hudi/issues/10675 ## Issue Summary Encountering challenges while integrating the Hudi Spark Connector with Golang. Insert, update, and upsert queries are resulting in errors, while create table and select queries work without issues. ## Environment - Java 8 - EMR cluster emr-version-7.0 - Spark version 3.5.0 - Spark Connector server started on port 15002 - Golang-v1.21.7 used to connect to Spark locally via SSH tunneling - Glue meta store for catalog ### Start Spark Server Command Executed the following command in AWS EMR CLI within Spark: ```bash cd /usr/lib/spark ./sbin/start-connect-server.sh \ --packages org.apache.spark:spark-connect_2.12:3.5.0 \ --jars /usr/lib/hudi/hudi-spark-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \ --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \ --conf "spark.sql.catalog.aws.glue.sync.tool.classes=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool" # Golang Code Snippet: package main import ( "fmt" "log" "github.com/apache/spark-connect-go/v34/client/sql" ) func main() { remote := "sc://localhost:8157" spark, err := sql.SparkSession.Builder.Remote(remote).Build() if err != nil { fmt.Println(err) log.Fatal("Failed to connect to Spark:", err) } // Example SQL query to show all tables query := "SHOW TABLES" alltab, err := spark.Sql(query) if err != nil { log.Fatal("Failed to execute SQL query:", err) } // Show the result alltab.Show(10, true) //Create the Hudi table with the basic schema _, err = spark.Sql(`create table hudi_table ( id bigint, name string, dt string ) using hudi LOCATION "s3://spark-hudi-table/output/" TBLPROPERTIES ( type = "cow", primaryKey = "id" ) partitioned by (dt);`) if err != nil { fmt.Println("failed to Create", err) } //Insert data into the Hudi table _, err = spark.Sql(`insert into default.hudi_table (id, name, dt) VALUES (1, 'test 1', '2023-11-11'), (2, 'test 2', '2023-11-12');`) if err != nil { fmt.Println("Failed to insert data into Hudi table:", err) } //Query the Hudi table result, err := spark.Sql("SELECT * FROM hudi_table") if err != nil { fmt.Println("Failed to query Hudi table:", err) } // // Show the result result.Show(10, true) // Stop the Spark session spark.Stop() } # Issue Details: While executing insert query, the Spark job fails with the following error taken fron Spark connector logs: #encountered errors: ``` 24/02/15 12:43:14 INFO Javalin: Starting Javalin ... 24/02/15 12:43:14 INFO Javalin: You are running Javalin 4.6.7 (released October 24, 2022. Your Javalin version is 479 days old. Consider checking for a newer version.). 24/02/15 12:43:14 INFO Javalin: Listening on http://localhost:39459/ 24/02/15 12:43:14 INFO Javalin: Javalin started in 151ms \o/ 24/02/15 12:43:14 INFO CodeGenerator: Code generated in 14.79973 ms 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO MultipartUploadOutputStream: close closed:false s3://spark-hudi-table/output/.hoodie/20240215124314041.commit.requested 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:16 INFO MultipartUploadOutputStream: close closed:false s3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties 24/02/15 12:43:17 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties' for reading 24/02/15 12:43:17 INFO SparkContext: Starting job: Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_i
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1946043346 ## CI report: * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
hudi-bot commented on PR #10673: URL: https://github.com/apache/hudi/pull/10673#issuecomment-1946043277 ## CI report: * 181c55e683edcc1743c39e955433c3bc24976883 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22454) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]
hudi-bot commented on PR #10671: URL: https://github.com/apache/hudi/pull/10671#issuecomment-1945950025 ## CI report: * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1945746116 ## CI report: * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
hudi-bot commented on PR #10674: URL: https://github.com/apache/hudi/pull/10674#issuecomment-1945733022 ## CI report: * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
hudi-bot commented on PR #10673: URL: https://github.com/apache/hudi/pull/10673#issuecomment-1945732958 ## CI report: * 181c55e683edcc1743c39e955433c3bc24976883 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22454) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]
hudi-bot commented on PR #10673: URL: https://github.com/apache/hudi/pull/10673#issuecomment-1945720924 ## CI report: * 181c55e683edcc1743c39e955433c3bc24976883 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] High runtime for a batch in SparkWriteHelper stage [hudi]
devjain47 commented on issue #6014: URL: https://github.com/apache/hudi/issues/6014#issuecomment-1945693518 @ad1happy2go , almost 20 GB data is present -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]
rmahindra123 opened a new pull request, #10674: URL: https://github.com/apache/hudi/pull/10674 ### Change Logs Fix flaky test introduced in PRhttps://github.com/apache/hudi/pull/10619 ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) Medium ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org