[jira] [Updated] (HUDI-7415) OLAP query need support read data from origin table by default

2024-02-15 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-7415:
-
Description: 
OLAP query need support read data from origin table by default,for 
example,query from olap engine such as starrocks presto,we can only read data 
in ro/rt sub table and get empty data from origin table,this is not fitable:

query mor table with starrocks as:

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt;
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---|
|20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---
1 row in set (2.11 sec)|

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro;
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---|
|20230522100703567|20230522100703567_0_0|1|partition=de|f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
{+}{-}{-}{+}-{-}++{-}--{-}{-}--{-}+{-}---
1 row in set (0.22 sec)|

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
Empty set (1.23 sec)

  was:
OLAP query need support read data from origin table by default,for 
example,query from olap engine such as starrocks presto,we can only read data 
in ro/rt sub table and get empty data from origin table,this is not fitable:

query mor table with starrocks as:

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
+-+---+++
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name
+-+---+++
| 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | 
f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
+-+---+++
1 row in set (2.11 sec)

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro;
+-+---+++
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name
+-+---+++
| 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | 
f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
+-+---+++
1 row in set (0.22 sec)

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
Empty set (1.23 sec)


> OLAP query need support read data from origin table by default
> --
>
> Key: HUDI-7415
> URL: https://issues.apache.org/jira/browse/HUDI-7415
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xy
>Assignee: xy
>Priority: Major
>
> OLAP query need support read data from origin table by default,for 
> example,query from olap engine such as starrocks presto,we can only read data 
> in ro/rt sub table and get empty data from origin table,this is not fitable:
> query mor table with starrocks as:
> MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_rt;
> {+}{-}{-}{+}-{-}++{-}

[jira] [Created] (HUDI-7415) OLAP query need support read data from origin table by default

2024-02-15 Thread xy (Jira)
xy created HUDI-7415:


 Summary: OLAP query need support read data from origin table by 
default
 Key: HUDI-7415
 URL: https://issues.apache.org/jira/browse/HUDI-7415
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: xy
Assignee: xy


OLAP query need support read data from origin table by default,for 
example,query from olap engine such as starrocks presto,we can only read data 
in ro/rt sub table and get empty data from origin table,this is not fitable:

query mor table with starrocks as:

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
+-+---+++
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name
+-+---+++
| 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | 
f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
+-+---+++
1 row in set (2.11 sec)

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22_ro;
+-+---+++
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name
+-+---+++
| 20230522100703567 | 20230522100703567_0_0 | 1 | partition=de | 
f14492ed-b672-4f60-8e86-6359790feb2a-0_0-17-2013_2023052210
+-+---+++
1 row in set (0.22 sec)

MySQL [(none)]> select * from hudi_catalog_01.hudi_test.test_mor_hudi_22;
Empty set (1.23 sec)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] Remove hive bugs [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10684:
URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947849437

   
   ## CI report:
   
   * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22474)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10682:
URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947849397

   
   ## CI report:
   
   * a8a65546c774415a5953a50f75442d9a9b558067 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947849223

   
   ## CI report:
   
   * 804d73922a136f6fed0fdcc559bfb697bda4942e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22473)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-02-15 Thread via GitHub


ad1happy2go commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-1947804747

   Had working session with @maheshguptags . We were able to consistently 
reproduce with composite key in his setup. although I couldn't reproduce in my 
setup. SO this issue is intermittent.
   
   @yihua Can you please check .hoodie (attached) as you requested.
   
   [hoodie.zip](https://github.com/apache/hudi/files/14307039/hoodie.zip)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove hive bugs [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10684:
URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947768029

   
   ## CI report:
   
   * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22474)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947767855

   
   ## CI report:
   
   * 50f21651c6c21b8a72c43247503b4d900d06a11e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22418)
 
   * 804d73922a136f6fed0fdcc559bfb697bda4942e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22473)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove hive bugs [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10684:
URL: https://github.com/apache/hudi/pull/10684#issuecomment-1947762759

   
   ## CI report:
   
   * ac152f0cb58ad798565b5dd56b531e9c8dc3d409 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1947762599

   
   ## CI report:
   
   * 50f21651c6c21b8a72c43247503b4d900d06a11e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22418)
 
   * 804d73922a136f6fed0fdcc559bfb697bda4942e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Clarify config descriptions (#10681)

2024-02-15 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9da1f2b15e2 [MINOR] Clarify config descriptions (#10681)
9da1f2b15e2 is described below

commit 9da1f2b15e2bf873a7d3db56dbc0183479c38c4c
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Feb 15 20:39:30 2024 -0800

[MINOR] Clarify config descriptions (#10681)

This aligns with the doc change here: 
https://github.com/apache/hudi/pull/10680
---
 .../src/main/scala/org/apache/hudi/DataSourceOptions.scala  | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
index 99080629e17..47a7c61a60f 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
@@ -500,7 +500,9 @@ object DataSourceWriteOptions {
 .defaultValue("false")
 .markAdvanced()
 .withDocumentation("If set to true, records from the incoming dataframe 
will not overwrite existing records with the same key during the write 
operation. " +
-  "This config is deprecated as of 0.14.0. Please use 
hoodie.datasource.insert.dup.policy instead.");
+  " **Note** Just for Insert operation in Spark SQL writing since 
0.14.0, users can switch to the config `hoodie.datasource.insert.dup.policy` 
instead " +
+  "for a simplified duplicate handling experience. The new config will be 
incorporated into all other writing flows and this config will be fully 
deprecated " +
+  "in future releases.");
 
   val PARTITIONS_TO_DELETE: ConfigProperty[String] = ConfigProperty
 .key("hoodie.datasource.write.partitions.to.delete")
@@ -597,7 +599,7 @@ object DataSourceWriteOptions {
 .withValidValues(NONE_INSERT_DUP_POLICY, DROP_INSERT_DUP_POLICY, 
FAIL_INSERT_DUP_POLICY)
 .markAdvanced()
 .sinceVersion("0.14.0")
-.withDocumentation("When operation type is set to \"insert\", users can 
optionally enforce a dedup policy. This policy will be employed "
+.withDocumentation("**Note** This is only applicable to Spark SQL 
writing.When operation type is set to \"insert\", users can optionally 
enforce a dedup policy. This policy will be employed "
   + " when records being ingested already exists in storage. Default 
policy is none and no action will be taken. Another option is to choose " +
   " \"drop\", on which matching records from incoming will be dropped and 
the rest will be ingested. Third option is \"fail\" which will " +
   "fail the write operation when same records are re-ingested. In other 
words, a given record as deduced by the key generation policy " +



Re: [PR] [MINOR][DOCS] Clarify config descriptions [hudi]

2024-02-15 Thread via GitHub


nsivabalan merged PR #10681:
URL: https://github.com/apache/hudi/pull/10681


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR][DOCS] Clarify config descriptions [hudi]

2024-02-15 Thread via GitHub


nsivabalan commented on PR #10681:
URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947745524

   https://github.com/apache/hudi/assets/513218/9190f75d-c679-47b4-beca-626cd3818499";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] MINOR_Remove_hive_bugs [hudi]

2024-02-15 Thread via GitHub


linliu-code opened a new pull request, #10684:
URL: https://github.com/apache/hudi/pull/10684

   ### Change Logs
   
   try to remove hive.
   
   ### Impact
   
   None.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] initial commit to update doris docs [hudi]

2024-02-15 Thread via GitHub


nfarah86 commented on PR #10683:
URL: https://github.com/apache/hudi/pull/10683#issuecomment-1947735093

   this pr is not ready yet- waiting for Doris to confirm the details


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] initial commit to update doris docs [hudi]

2024-02-15 Thread via GitHub


nfarah86 opened a new pull request, #10683:
URL: https://github.com/apache/hudi/pull/10683

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   update doris doc with compatibility 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   update doris doc
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10682:
URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947722192

   
   ## CI report:
   
   * a8a65546c774415a5953a50f75442d9a9b558067 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Cleanup FileSystemViewManager code [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10682:
URL: https://github.com/apache/hudi/pull/10682#issuecomment-1947717851

   
   ## CI report:
   
   * a8a65546c774415a5953a50f75442d9a9b558067 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Cleanup FileSystemViewManager code [hudi]

2024-02-15 Thread via GitHub


voonhous opened a new pull request, #10682:
URL: https://github.com/apache/hudi/pull/10682

   ### Change Logs
   
   Cleaning up `FileSystemViewManager#createViewManager` related code  that is 
passing around a Hadoop Configuration that is never used.
   
   Added a doctstring to indicate that in `#init` function in 
`HoodieTableMetaClient` is used for tests.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [X] Change Logs and Impact were stated clearly
   - [X] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947650248

   
   ## CI report:
   
   * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Clarify config descriptions [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10681:
URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947645303

   
   ## CI report:
   
   * 5edfc37412400c5d01c154b122e7fd41491b7a86 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22470)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Clarify config descriptions [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10681:
URL: https://github.com/apache/hudi/pull/10681#issuecomment-1947640295

   
   ## CI report:
   
   * 5edfc37412400c5d01c154b122e7fd41491b7a86 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Clarify config descriptions [hudi]

2024-02-15 Thread via GitHub


bhasudha opened a new pull request, #10681:
URL: https://github.com/apache/hudi/pull/10681

   This aligns with the doc change here: [10680]( 
https://github.com/apache/hudi/pull/10680)
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [DOCS] Clarify release notes on duplicate handling in Spark SQL and relevant configs (#10680)

2024-02-15 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 7f125b6f310 [DOCS] Clarify release notes on duplicate handling in 
Spark SQL and relevant configs (#10680)
7f125b6f310 is described below

commit 7f125b6f3107fba9070f7e2c20fc58fbef564392
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Feb 15 17:05:39 2024 -0800

[DOCS] Clarify release notes on duplicate handling in Spark SQL and 
relevant configs (#10680)
---
 website/docs/configurations.md | 104 ++---
 website/releases/release-0.14.0.md |   8 +-
 .../version-0.14.0/configurations.md   |   4 +-
 .../version-0.14.1/configurations.md   |   4 +-
 4 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 01ef8401954..18c3581e305 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -127,59 +127,59 @@ Options useful for writing tables via 
`write.format.option(...)`
 [**Advanced Configs**](#Write-Options-advanced-configs)
 
 
-| Config Name  
| Default   
   | Description



  [...]
-| 

 |  | 

 [...]
-| 
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
   | (N/A)  
  | Serde properties to hive table.`Config Param: HIVE_TABLE_SERDE_PROPERTIES`   


  [...]
-| 
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
   | (N/A)  
  | Additional properties to store with 
table.`Config Param: HIVE_TABLE_PROPERTIES`   


 [...]
-| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)   
| (N/A) 
   | Controls whether overwrite 
use dynamic or static mode, if not configured, respect 
spark.sql.sources.partitionOverwriteMode`Config Param: OVERWRITE_MODE``Since Version: 0.14.0`   
   [...]
-| 
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
 | (N/A)
| Comma separated list of partitions to 
delete. Allows use of wildcard *`Config Param: PARTITIONS_TO_DELETE`  


   [...]
-| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A) 
   | Table name for the 
datasource write. Also used to register the table into meta stores.`Config Param: TABLE_NAME`

   [...]
-| 
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
  | true

Re: [PR] [DOCS] Clarify release notes on duplicate handling in Spark SQL and r… [hudi]

2024-02-15 Thread via GitHub


bhasudha merged PR #10680:
URL: https://github.com/apache/hudi/pull/10680


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7381] Fix flaky test introduced in PR 10619 (#10674)

2024-02-15 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fe488bc1b64 [HUDI-7381]  Fix flaky test introduced in PR 10619 (#10674)
fe488bc1b64 is described below

commit fe488bc1b649f1a9f90fcc178923ee12be3ce90f
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Thu Feb 15 16:40:56 2024 -0800

[HUDI-7381]  Fix flaky test introduced in PR 10619 (#10674)

Co-authored-by: rmahindra123 
---
 .../table/action/compact/TestHoodieCompactor.java   | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
index 313f14ce989..4ad19bfbfc4 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java
@@ -195,19 +195,18 @@ public class TestHoodieCompactor extends 
HoodieSparkClientTestHarness {
   String newCommitTime = "100";
   writeClient.startCommitWithTime(newCommitTime);
 
-  List records = dataGen.generateInserts(newCommitTime, 100);
+  List records = dataGen.generateInserts(newCommitTime, 
1000);
   JavaRDD recordsRDD = jsc.parallelize(records, 1);
   writeClient.insert(recordsRDD, newCommitTime).collect();
 
-  // Update all the 100 records
-  newCommitTime = "101";
-  updateRecords(config, newCommitTime, records);
-
-  assertLogFilesNumEqualsTo(config, 1);
-
-  String compactionInstantTime = "102";
-  HoodieData result = compact(writeClient, 
compactionInstantTime);
-
+  // Update all the 1000 records across 5 commits to generate sufficient 
log files.
+  int i = 1;
+  for (; i < 5; i++) {
+newCommitTime = String.format("10%s", i);
+updateRecords(config, newCommitTime, records);
+assertLogFilesNumEqualsTo(config, i);
+  }
+  HoodieData result = compact(writeClient, 
String.format("10%s", i));
   verifyCompaction(result);
 
   // Verify compaction.requested, compaction.completed metrics counts.
@@ -243,7 +242,6 @@ public class TestHoodieCompactor extends 
HoodieSparkClientTestHarness {
 assertLogFilesNumEqualsTo(config, 1);
 
 HoodieData result = compact(writeClient, "10" + (i + 1));
-
 verifyCompaction(result);
 
 // Verify compaction.requested, compaction.completed metrics counts.
@@ -304,7 +302,6 @@ public class TestHoodieCompactor extends 
HoodieSparkClientTestHarness {
 for (String partitionPath : dataGen.getPartitionPaths()) {
   assertTrue(writeStatuses.stream().anyMatch(writeStatus -> 
writeStatus.getStat().getPartitionPath().contentEquals(partitionPath)));
 }
-
 writeStatuses.forEach(writeStatus -> {
   final HoodieWriteStat.RuntimeStats stats = 
writeStatus.getStat().getRuntimeStats();
   assertNotNull(stats);



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


yihua merged PR #10674:
URL: https://github.com/apache/hudi/pull/10674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Fix zookeeper session expiration bug (#10671)

2024-02-15 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7058d12e748 [MINOR] Fix zookeeper session expiration bug (#10671)
7058d12e748 is described below

commit 7058d12e74832dc420975269b698782add5e4fff
Author: Lin Liu <141371752+linliu-c...@users.noreply.github.com>
AuthorDate: Thu Feb 15 16:38:29 2024 -0800

[MINOR] Fix zookeeper session expiration bug (#10671)
---
 .../TestDFSHoodieTestSuiteWriterAdapter.java   |  2 +-
 .../integ/testsuite/TestFileDeltaInputWriter.java  |  2 +-
 .../testsuite/job/TestHoodieTestSuiteJob.java  |  3 +-
 .../reader/TestDFSAvroDeltaInputReader.java|  2 +-
 .../reader/TestDFSHoodieDatasetInputReader.java|  3 +-
 .../callback/TestKafkaCallbackProvider.java| 17 +++--
 .../deltastreamer/HoodieDeltaStreamerTestBase.java | 13 +++
 .../deltastreamer/TestHoodieDeltaStreamer.java |  4 +--
 ...TestHoodieDeltaStreamerSchemaEvolutionBase.java |  1 -
 .../schema/TestFilebasedSchemaProvider.java|  2 +-
 .../utilities/sources/BaseTestKafkaSource.java | 14 
 .../utilities/sources/TestAvroKafkaSource.java | 17 +
 .../utilities/sources/TestSqlFileBasedSource.java  | 40 ++
 .../hudi/utilities/sources/TestSqlSource.java  |  2 +-
 .../debezium/TestAbstractDebeziumSource.java   | 18 --
 .../sources/helpers/TestKafkaOffsetGen.java| 14 
 .../utilities/testutils/UtilitiesTestBase.java | 11 +-
 .../AbstractCloudObjectsSourceTestBase.java|  2 +-
 .../transform/TestSqlFileBasedTransformer.java | 36 ++-
 19 files changed, 129 insertions(+), 74 deletions(-)

diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
index 70430328553..f2ec458bf2d 100644
--- 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
+++ 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
@@ -69,7 +69,7 @@ public class TestDFSHoodieTestSuiteWriterAdapter extends 
UtilitiesTestBase {
   }
 
   @AfterAll
-  public static void cleanupClass() {
+  public static void cleanupClass() throws IOException {
 UtilitiesTestBase.cleanUpUtilitiesTestServices();
   }
 
diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java
 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java
index 4f99292b3fd..d8e54984367 100644
--- 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java
+++ 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestFileDeltaInputWriter.java
@@ -63,7 +63,7 @@ public class TestFileDeltaInputWriter extends 
UtilitiesTestBase {
   }
 
   @AfterAll
-  public static void cleanupClass() {
+  public static void cleanupClass() throws IOException {
 UtilitiesTestBase.cleanUpUtilitiesTestServices();
   }
 
diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java
 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java
index 087ffb8e400..9a4a2eee619 100644
--- 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java
+++ 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/job/TestHoodieTestSuiteJob.java
@@ -49,6 +49,7 @@ import org.junit.jupiter.api.Test;
 import org.junit.jupiter.params.provider.Arguments;
 import org.junit.jupiter.params.provider.MethodSource;
 
+import java.io.IOException;
 import java.util.UUID;
 import java.util.stream.Stream;
 
@@ -134,7 +135,7 @@ public class TestHoodieTestSuiteJob extends 
UtilitiesTestBase {
   }
 
   @AfterAll
-  public static void cleanupClass() {
+  public static void cleanupClass() throws IOException {
 UtilitiesTestBase.cleanUpUtilitiesTestServices();
   }
 
diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java
 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java
index 089a9d9fb55..8f93a82865a 100644
--- 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java
+++ 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/reader/TestDFSAvroDeltaInputReader.java
@@ -48,7 +48,7 @@ public class TestDFSAvroDeltaInputReader extends 
UtilitiesTestBase {
   }
 
   @AfterAll
-  public static void cleanupClass() {
+  public static void cleanupClass() throws IOException {
 UtilitiesTestBase.cleanUpUtilitiesTestServices();

Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


vinothchandar merged PR #10671:
URL: https://github.com/apache/hudi/pull/10671


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947539677

   
   ## CI report:
   
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465)
 
   * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-02-15 Thread via GitHub


yihua commented on code in PR #10591:
URL: https://github.com/apache/hudi/pull/10591#discussion_r1491811838


##
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java:
##
@@ -432,9 +432,9 @@ private static String getTmpSerializerFile() {
 return TMP_DIR + UUID.randomUUID().toString() + ".ser";
   }
 
-  private  T deSerializeOperationResult(String inputP, FileSystem fs) 
throws Exception {
-Path inputPath = new Path(inputP);
-InputStream inputStream = fs.open(inputPath);
+  private  T deSerializeOperationResult(HoodieLocation inputLocation,

Review Comment:
   I think the renaming makes sense.  I've addressed the renaming in #10672 .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [DOCS] Clarify release notes on duplicate handling in Spark SQL and r… [hudi]

2024-02-15 Thread via GitHub


bhasudha opened a new pull request, #10680:
URL: https://github.com/apache/hudi/pull/10680

   …elevant configs
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947532554

   
   ## CI report:
   
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465)
 
   * cd7d9d81a83d8ce904f827058c8d72f5bc46a5dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Rename test class to TestHadoopStorageConfiguration (#10670)

2024-02-15 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f5b9d071a62 [MINOR] Rename test class to 
TestHadoopStorageConfiguration (#10670)
f5b9d071a62 is described below

commit f5b9d071a6221a40ab54803c48a79d0b58d45f10
Author: Y Ethan Guo 
AuthorDate: Thu Feb 15 15:27:38 2024 -0800

[MINOR] Rename test class to TestHadoopStorageConfiguration (#10670)
---
 ...oopStorageConfiguration.java => TestHadoopStorageConfiguration.java} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java
 
b/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java
similarity index 92%
rename from 
hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java
rename to 
hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java
index 5225c599fb4..79658ccc441 100644
--- 
a/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestStorageConfigurationHadoopStorageConfiguration.java
+++ 
b/hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java
@@ -29,7 +29,7 @@ import java.util.Map;
 /**
  * Tests {@link HadoopStorageConfiguration}.
  */
-public class TestStorageConfigurationHadoopStorageConfiguration extends 
BaseTestStorageConfiguration {
+public class TestHadoopStorageConfiguration extends 
BaseTestStorageConfiguration {
   @Override
   protected StorageConfiguration 
getStorageConfiguration(Configuration conf) {
 return new HadoopStorageConfiguration(conf);



Re: [PR] [MINOR] Rename test class to TestHadoopStorageConfiguration [hudi]

2024-02-15 Thread via GitHub


jonvex merged PR #10670:
URL: https://github.com/apache/hudi/pull/10670


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


yihua merged PR #10673:
URL: https://github.com/apache/hudi/pull/10673


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


yihua commented on code in PR #10673:
URL: https://github.com/apache/hudi/pull/10673#discussion_r1491788224


##
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java:
##
@@ -238,7 +239,7 @@ private static HFileReader createReader(String hFilePath, 
FileSystem fileSystem)
   LOG.info("Opening HFile for reading :" + hFilePath);
   Path path = new Path(hFilePath);
   long fileSize = fileSystem.getFileStatus(path).getLen();
-  FSDataInputStream stream = fileSystem.open(path);
+  SeekableDataInputStream stream = new 
HadoopSeekableDataInputStream(fileSystem.open(path));

Review Comment:
   This will be replaced by the new storage API call which returns 
`SeekableDataInputStream` directly.  Hadoop is going to be fully removed here 
in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-02-15 Thread nadine (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817816#comment-17817816
 ] 

nadine edited comment on HUDI-7414 at 2/15/24 11:25 PM:


DOCS- removed the sync base path config reference here: 
[https://github.com/apache/hudi/pull/10679/files]


was (Author: JIRAUSER298226):
removed the sync base path reference here: 
https://github.com/apache/hudi/pull/10679/files

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nadine
>Assignee: nadine
>Priority: Minor
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader (#10673)

2024-02-15 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 80f9f1ef36c [HUDI-7410] Use SeekableDataInputStream as the input of 
native HFile reader (#10673)
80f9f1ef36c is described below

commit 80f9f1ef36c0e7953a13ee4b433a6afc623ad4cc
Author: Y Ethan Guo 
AuthorDate: Thu Feb 15 15:26:02 2024 -0800

[HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader 
(#10673)
---
 .../bootstrap/index/HFileBootstrapIndex.java   |  5 ++-
 .../io/storage/HoodieNativeAvroHFileReader.java| 11 +++--
 .../TestInLineFileSystemWithHFileReader.java   |  8 ++--
 .../hudi/io/ByteArraySeekableDataInputStream.java  | 47 ++
 .../org/apache/hudi/io/hfile/HFileBlockReader.java |  6 +--
 .../org/apache/hudi/io/hfile/HFileReaderImpl.java  |  8 ++--
 .../org/apache/hudi/io/hfile/TestHFileReader.java  | 38 +
 7 files changed, 71 insertions(+), 52 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
index 989b0ad1e6d..7a6de5fe994 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
@@ -33,6 +33,8 @@ import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream;
+import org.apache.hudi.io.SeekableDataInputStream;
 import org.apache.hudi.io.hfile.HFileReader;
 import org.apache.hudi.io.hfile.HFileReaderImpl;
 import org.apache.hudi.io.hfile.Key;
@@ -41,7 +43,6 @@ import org.apache.hudi.io.storage.HoodieHFileUtils;
 import org.apache.hudi.io.util.IOUtils;
 
 import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.CellComparatorImpl;
@@ -238,7 +239,7 @@ public class HFileBootstrapIndex extends BootstrapIndex {
   LOG.info("Opening HFile for reading :" + hFilePath);
   Path path = new Path(hFilePath);
   long fileSize = fileSystem.getFileStatus(path).getLen();
-  FSDataInputStream stream = fileSystem.open(path);
+  SeekableDataInputStream stream = new 
HadoopSeekableDataInputStream(fileSystem.open(path));
   return new HFileReaderImpl(stream, fileSize);
 }
 
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java
index cc3833996b9..e760b33b9e2 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java
@@ -28,9 +28,13 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.ClosableIterator;
 import org.apache.hudi.common.util.collection.CloseableMappingIterator;
 import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.common.util.io.ByteBufferBackedInputStream;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.hadoop.fs.HadoopFSUtils;
+import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream;
+import org.apache.hudi.io.ByteArraySeekableDataInputStream;
+import org.apache.hudi.io.SeekableDataInputStream;
 import org.apache.hudi.io.hfile.HFileReader;
 import org.apache.hudi.io.hfile.HFileReaderImpl;
 import org.apache.hudi.io.hfile.KeyValue;
@@ -41,7 +45,6 @@ import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
 import org.apache.avro.generic.IndexedRecord;
 import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.slf4j.Logger;
@@ -256,15 +259,15 @@ public class HoodieNativeAvroHFileReader extends 
HoodieAvroHFileReaderImplBase {
   }
 
   private HFileReader newHFileReader() throws IOException {
-FSDataInputStream inputStream;
+SeekableDataInputStream inputStream;
 long fileSize;
 if (path.isPresent()) {
   FileSystem fs = HadoopFSUtils.getFs(path.get(), conf);
   fileSize = fs.getFileStatus(path.get()).getLen();
-  inputStream = fs.open(path.get());
+  inputStream = new HadoopSeekableDataInputStream(fs.open(path.get()));
 } else {
   fileSize = bytesContent.get().length;
-  inputStream = new FSDataInputStream(new 
Seekab

[jira] [Commented] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-02-15 Thread nadine (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817816#comment-17817816
 ] 

nadine commented on HUDI-7414:
--

removed the sync base path reference here: 
https://github.com/apache/hudi/pull/10679/files

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nadine
>Assignee: nadine
>Priority: Minor
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-02-15 Thread nadine (Jira)
nadine created HUDI-7414:


 Summary: Remove hoodie.gcp.bigquery.sync.base_path reference in 
the gcp docs
 Key: HUDI-7414
 URL: https://issues.apache.org/jira/browse/HUDI-7414
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nadine
Assignee: nadine


There was a jira issue filed where sarfaraz wanted to know more about the 
`hoodie.gcp.bigquery.sync.base_path`.  

In the BigQuerySyncConfig file, there a config property set: 
[https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
  But it’s not used anywhere else in the big query code base.

However, I see
[https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
 being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
is superfluous. I’m seeing as a config being set, but not being used anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7289) Fix parameters for Big Query Sync

2024-02-15 Thread nadine (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817810#comment-17817810
 ] 

nadine commented on HUDI-7289:
--

updated the hoodie.gcp.bigquery.sync.require_partition_filter config

[https://github.com/apache/hudi/pull/10679/files]

 

The {{hoodie.gcp.bigquery.sync.base_path}}  is superfluous - outside of it 
being declared, it's not being used. I removed the reference in the gcp doc- 
and will update the code base to remove reference. 

 

> Fix parameters for Big Query Sync
> -
>
> Key: HUDI-7289
> URL: https://issues.apache.org/jira/browse/HUDI-7289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Aditya Goenka
>Assignee: nadine
>Priority: Minor
> Fix For: 1.1.0
>
>
> revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/]
>  
> From a user - 
> Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is 
> missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part 
> of Hudi 0.14.1.
> Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very 
> clear, even the example is not understandable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] DOCS-updated gcp config doc [hudi]

2024-02-15 Thread via GitHub


nfarah86 opened a new pull request, #10679:
URL: https://github.com/apache/hudi/pull/10679

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   updated  `hoodie.gcp.bigquery.sync.require_partition_filter` and removed 
`hoodie.gcp.bigquery.sync.base_path`
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   updated https://hudi.apache.org/docs/next/gcp_bigquery
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   
   https://github.com/apache/hudi/assets/5392555/37aa9a50-b265-4771-9126-ac1d971e98e8";>
   
   @xushiyan please review
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947489426

   
   ## CI report:
   
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947474263

   
   ## CI report:
   
   * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463)
 
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Add parallel listing of existing partitions [hudi]

2024-02-15 Thread via GitHub


VitoMakarevich commented on PR #10460:
URL: https://github.com/apache/hudi/pull/10460#issuecomment-1947454649

   @yihua @nsivabalan is there any chance you'll be able to take a look on it? 
It's a significant improvement and makes sync much faster... We've been running 
it in production for a month already and there are no issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947416216

   
   ## CI report:
   
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463)
 
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947416115

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460)
 
   * 1ec69433726191ec77a14b37719709d21e35059a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947407475

   
   ## CI report:
   
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463)
 
   * 2e4bca19f3eac08c6377b96e90704a2bda95ea05 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947407347

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460)
 
   * 1ec69433726191ec77a14b37719709d21e35059a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947397878

   
   ## CI report:
   
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22463)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1947397836

   
   ## CI report:
   
   * 457f187f06803c99276135cc8e175df2b14386ba Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947338993

   
   ## CI report:
   
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   * d2145400b8cfe2d9b8f173b45ae25a817c8c5504 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947329593

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947321137

   
   ## CI report:
   
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947161411

   
   ## CI report:
   
   * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459)
 
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22462)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Setting hoodie.datasource.insert.dup.policy to drop still upserts the record in 0.14 [hudi]

2024-02-15 Thread via GitHub


keerthiskating commented on issue #10650:
URL: https://github.com/apache/hudi/issues/10650#issuecomment-1947116699

   @ad1happy2go I do not have the bandwidth to contribute.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1947005128

   
   ## CI report:
   
   * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459)
 
   * 31a6e5e6bf3187b7c87ee0459628624b570d6a20 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1947004908

   
   ## CI report:
   
   * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455)
 
   * 457f187f06803c99276135cc8e175df2b14386ba Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1947004739

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453)
 
   * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946979094

   
   ## CI report:
   
   * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1946978954

   
   ## CI report:
   
   * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455)
 
   * 457f187f06803c99276135cc8e175df2b14386ba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1946978665

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453)
 
   * a5ef9c5b810f9b68f1f668aef1a66749fdeb7fae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Can't read a table with timestamp based partition key generator [hudi]

2024-02-15 Thread via GitHub


ofinchuk-bloomberg opened a new issue, #10678:
URL: https://github.com/apache/hudi/issues/10678

   
   Can't read a table which was created using TimestampBasedKeyGenerator Or 
CustomKeyGenerator for timestamp partition.
   Issue is that `ts` remains Long type, while _hoodie_partition_path is formed 
as a String, so Simple operation to read doesn't work and throws Exception
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   `
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   object SprkDemo {
   
   def main(args:Array[String]): Unit ={
   
   val spark = SparkSession.builder()
   .master("local[1]")
   .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
   .config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
   .appName("SparkByExample")
   .getOrCreate();
   
   import spark.implicits._
   spark.createDataset(List(("id1","name1", System.currentTimeMillis()),
   ("id2","name2",(System.currentTimeMillis()+1))
   ))
   .toDF("id","name","ts")
   .write
   .format("hudi")
   .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.CustomKeyGenerator")
   .option("hoodie.datasource.write.partitionpath.field", 
"ts:timestamp")
   .option("hoodie.datasource.write.recordkey.field", "id")
   .option("hoodie.datasource.write.precombined.field", "name")
   .option("hoodie.table.name", "hudi_cow2")
   .option("hoodie.keygen.timebased.timestamp.type", 
"EPOCHMILLISECONDS")
   .option("hoodie.keygen.timebased.output.dateformat", 
"MMdd-HH")
   .mode(SaveMode.Overwrite)
   .save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2")
   
   
spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*")
   .show()
   
   spark.read.format("hudi")
   .option("hoodie.schema.on.read.enable","true")
   .load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/")
   .show()
   
   }
   }
   `
   when reading parquet I see next data:
   `
   
+---++--+--++---+-+-++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id| name|   ts|date|
   
+---++--+--++---+-+-++
   |  20240214184652987|20240214184652987...|   id1|   
20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00|
   |  20240214184652987|20240214184652987...|   id2|   
20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01|
   
+---++--+--++---+-+-++
   `
   
   
   
   **Expected behavior**
   
   Table should be read successfully into spark dataframe
   
   **Environment Description**
   
   I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment
   
   * Running on Docker? (yes/no) :no
   
   
   **Stacktrace**
   
   ```
   Exception in thread "main" java.lang.RuntimeException: Failed to cast value 
'20240214-18' to 'LongType' for partition column 'ts'
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69)
at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280)
at 
org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264)
at 
org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401)
at 
org.

Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


rmahindra123 commented on code in PR #10674:
URL: https://github.com/apache/hudi/pull/10674#discussion_r1491428474


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java:
##
@@ -195,19 +195,18 @@ public void testWriteStatusContentsAfterCompaction() 
throws Exception {
   String newCommitTime = "100";
   writeClient.startCommitWithTime(newCommitTime);
 
-  List records = dataGen.generateInserts(newCommitTime, 100);
+  List records = dataGen.generateInserts(newCommitTime, 
1000);
   JavaRDD recordsRDD = jsc.parallelize(records, 1);
   writeClient.insert(recordsRDD, newCommitTime).collect();
 
   // Update all the 100 records

Review Comment:
   The idea is to ensure that the scan times are in milliseconds as opposed to 
microsecs, so increasing the load.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946814107

   
   ## CI report:
   
   * 1b911520a42e187c1e4bb33345c630a1866bd375 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22459)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10676:
URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946813984

   
   ## CI report:
   
   * d5f38b26cede75b6d07367cb661f0fd20256e3e0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users

2024-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7413:
-
Labels: pull-request-available  (was: )

> Make Issues with schema easier to understand for users
> --
>
> Key: HUDI-7413
> URL: https://issues.apache.org/jira/browse/HUDI-7413
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Provide exceptions that classify issues with schema. Additionally, provide 
> users with a clear explanation of what is wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7413] make schema errors better [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10677:
URL: https://github.com/apache/hudi/pull/10677#issuecomment-1946787792

   
   ## CI report:
   
   * 1b911520a42e187c1e4bb33345c630a1866bd375 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users

2024-02-15 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7413:
--
Status: Patch Available  (was: In Progress)

> Make Issues with schema easier to understand for users
> --
>
> Key: HUDI-7413
> URL: https://issues.apache.org/jira/browse/HUDI-7413
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>
> Provide exceptions that classify issues with schema. Additionally, provide 
> users with a clear explanation of what is wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7413) Make Issues with schema easier to understand for users

2024-02-15 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7413:
--
Status: In Progress  (was: Open)

> Make Issues with schema easier to understand for users
> --
>
> Key: HUDI-7413
> URL: https://issues.apache.org/jira/browse/HUDI-7413
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>
> Provide exceptions that classify issues with schema. Additionally, provide 
> users with a clear explanation of what is wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)

2024-02-15 Thread Haitham Eltaweel (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817743#comment-17817743
 ] 

Haitham Eltaweel edited comment on HUDI-7412 at 2/15/24 5:36 PM:
-

Update: the same error (OOM) also occurs when writing the DF using parquet 
format. Find a snapshot from Spark UI below: 
!image-2024-02-15-11-35-19-156.png!


was (Author: JIRAUSER301642):
Update: the same OOM error also occurs when writing using parquet format. Find 
a snapshot from Spark UI below: 
!image-2024-02-15-11-35-19-156.png!

> OOM error after upgrade to hudi 0.13 when writing big record (stream or batch 
> job)
> --
>
> Key: HUDI-7412
> URL: https://issues.apache.org/jira/browse/HUDI-7412
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
> Environment: Amazon EMR version emr-6.11.1
> Spark version 3.3.2 
> Hive version 3.1.3
> Hadoop version 3.3.3
> hudi version 0.13
>Reporter: Haitham Eltaweel
>Priority: Major
> Attachments: image-2024-02-15-11-35-19-156.png
>
>
> After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) 
> can not be written to the destination location due to OOM error even after 
> increasing Spark resources memory.
> Find the error details: java.lang.OutOfMemoryError: Java heap space.
> The error never happened when running same jab using hudi 0.11.
> Find below the use case details:
> Read one json file which has one record of 900MB from S3 source location, 
> transform the DF then write the output DF to S3 target location. When using 
> upsert hudi operation, the error happens at Tagging job ([mapToPair at 
> HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0])
>  and when using insert hudi operation, the error happens at Building workload 
> profile job. The error happens whether I run the job as Spark structured 
> streaming job or batch job.
> Find the batch job code snippet shared below. I obfuscated some values.
> from pyspark.sql import functions as f
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
>  
> def main():
>  
>     hudi_options = {
>         'hoodie.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
>         'hoodie.datasource.write.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.keygenerator.class': 
> 'org.apache.hudi.keygen.CustomKeyGenerator',
>         'hoodie.datasource.write.recordkey.field': 'id',
>         'hoodie.datasource.write.precombine.field': 'ts',
>         'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE',
>         'hoodie.embed.timeline.server': False,
>         'hoodie.index.type': 'SIMPLE',
>         'hoodie.parquet.compression.codec': 'snappy',
>         'hoodie.clean.async': True,
>         'hoodie.parquet.max.file.size': 125829120,
>         'hoodie.parquet.small.file.limit': 104857600,
>         'hoodie.parquet.block.size': 125829120,
>         'hoodie.metadata.enable': True,
>         'hoodie.metadata.validate': True,
>         "hoodie.datasource.write.hive_style_partitioning": True,
>         'hoodie.datasource.hive_sync.support_timestamp': True,
>         "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x",
>         'hoodie.datasource.hive_sync.username': 'xxx',
>         'hoodie.datasource.hive_sync.password': 'xxx',
>         "hoodie.datasource.hive_sync.database": "xxx",
>         "hoodie.datasource.hive_sync.table": "hudi_streaming_reco",
>         "hoodie.datasource.hive_sync.partition_fields": "insert_hr",
>         "hoodie.datasource.hive_sync.enable": True,
>         'hoodie.datasource.hive_sync.partition_extractor_class': 
> 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
>     }
>  
>     spark=SparkSession.builder.getOrCreate()
>  
>     inputPath = "s3://xxx/"
>  
>     transfomredDF = (
>         spark
>         .read
>         .text(inputPath, wholetext=True)
>         .select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                     f.col("value").alias("raw_data"),
>                     f.get_json_object(f.col("value"), "$._id").alias("id"),
>                     f.get_json_object(f.col("value"), 
> "$.metadata.createdDateTime").alias("ts"),
>                     f.input_file_name().alias("input_file_name"))
>     )
>  
>  
>  
>     s3_output_path = "s3://xxx/"
>     transfomredDF \
>     .write.format("hudi") \
>     .options(**hudi_options) \
>     .option('hoodie.datasource.write.operation', 'upsert') \
>     .save(s3_output_path,mode='append')
>  
> if __name__ == "__main__":
>     main()
>  
> Find the spark su

[jira] [Created] (HUDI-7413) Make Issues with schema easier to understand for users

2024-02-15 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7413:
-

 Summary: Make Issues with schema easier to understand for users
 Key: HUDI-7413
 URL: https://issues.apache.org/jira/browse/HUDI-7413
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


Provide exceptions that classify issues with schema. Additionally, provide 
users with a clear explanation of what is wrong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)

2024-02-15 Thread Haitham Eltaweel (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817743#comment-17817743
 ] 

Haitham Eltaweel commented on HUDI-7412:


Update: the same OOM error also occurs when writing using parquet format. Find 
a snapshot from Spark UI below: 
!image-2024-02-15-11-35-19-156.png!

> OOM error after upgrade to hudi 0.13 when writing big record (stream or batch 
> job)
> --
>
> Key: HUDI-7412
> URL: https://issues.apache.org/jira/browse/HUDI-7412
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
> Environment: Amazon EMR version emr-6.11.1
> Spark version 3.3.2 
> Hive version 3.1.3
> Hadoop version 3.3.3
> hudi version 0.13
>Reporter: Haitham Eltaweel
>Priority: Major
> Attachments: image-2024-02-15-11-35-19-156.png
>
>
> After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) 
> can not be written to the destination location due to OOM error even after 
> increasing Spark resources memory.
> Find the error details: java.lang.OutOfMemoryError: Java heap space.
> The error never happened when running same jab using hudi 0.11.
> Find below the use case details:
> Read one json file which has one record of 900MB from S3 source location, 
> transform the DF then write the output DF to S3 target location. When using 
> upsert hudi operation, the error happens at Tagging job ([mapToPair at 
> HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0])
>  and when using insert hudi operation, the error happens at Building workload 
> profile job. The error happens whether I run the job as Spark structured 
> streaming job or batch job.
> Find the batch job code snippet shared below. I obfuscated some values.
> from pyspark.sql import functions as f
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
>  
> def main():
>  
>     hudi_options = {
>         'hoodie.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
>         'hoodie.datasource.write.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.keygenerator.class': 
> 'org.apache.hudi.keygen.CustomKeyGenerator',
>         'hoodie.datasource.write.recordkey.field': 'id',
>         'hoodie.datasource.write.precombine.field': 'ts',
>         'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE',
>         'hoodie.embed.timeline.server': False,
>         'hoodie.index.type': 'SIMPLE',
>         'hoodie.parquet.compression.codec': 'snappy',
>         'hoodie.clean.async': True,
>         'hoodie.parquet.max.file.size': 125829120,
>         'hoodie.parquet.small.file.limit': 104857600,
>         'hoodie.parquet.block.size': 125829120,
>         'hoodie.metadata.enable': True,
>         'hoodie.metadata.validate': True,
>         "hoodie.datasource.write.hive_style_partitioning": True,
>         'hoodie.datasource.hive_sync.support_timestamp': True,
>         "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x",
>         'hoodie.datasource.hive_sync.username': 'xxx',
>         'hoodie.datasource.hive_sync.password': 'xxx',
>         "hoodie.datasource.hive_sync.database": "xxx",
>         "hoodie.datasource.hive_sync.table": "hudi_streaming_reco",
>         "hoodie.datasource.hive_sync.partition_fields": "insert_hr",
>         "hoodie.datasource.hive_sync.enable": True,
>         'hoodie.datasource.hive_sync.partition_extractor_class': 
> 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
>     }
>  
>     spark=SparkSession.builder.getOrCreate()
>  
>     inputPath = "s3://xxx/"
>  
>     transfomredDF = (
>         spark
>         .read
>         .text(inputPath, wholetext=True)
>         .select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                     f.col("value").alias("raw_data"),
>                     f.get_json_object(f.col("value"), "$._id").alias("id"),
>                     f.get_json_object(f.col("value"), 
> "$.metadata.createdDateTime").alias("ts"),
>                     f.input_file_name().alias("input_file_name"))
>     )
>  
>  
>  
>     s3_output_path = "s3://xxx/"
>     transfomredDF \
>     .write.format("hudi") \
>     .options(**hudi_options) \
>     .option('hoodie.datasource.write.operation', 'upsert') \
>     .save(s3_output_path,mode='append')
>  
> if __name__ == "__main__":
>     main()
>  
> Find the spark submit command used :
> spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf 
> spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> --conf spark.serializer=org.apache.spark.serializer.KryoS

[jira] [Updated] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)

2024-02-15 Thread Haitham Eltaweel (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haitham Eltaweel updated HUDI-7412:
---
Attachment: image-2024-02-15-11-35-19-156.png

> OOM error after upgrade to hudi 0.13 when writing big record (stream or batch 
> job)
> --
>
> Key: HUDI-7412
> URL: https://issues.apache.org/jira/browse/HUDI-7412
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
> Environment: Amazon EMR version emr-6.11.1
> Spark version 3.3.2 
> Hive version 3.1.3
> Hadoop version 3.3.3
> hudi version 0.13
>Reporter: Haitham Eltaweel
>Priority: Major
> Attachments: image-2024-02-15-11-35-19-156.png
>
>
> After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) 
> can not be written to the destination location due to OOM error even after 
> increasing Spark resources memory.
> Find the error details: java.lang.OutOfMemoryError: Java heap space.
> The error never happened when running same jab using hudi 0.11.
> Find below the use case details:
> Read one json file which has one record of 900MB from S3 source location, 
> transform the DF then write the output DF to S3 target location. When using 
> upsert hudi operation, the error happens at Tagging job ([mapToPair at 
> HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0])
>  and when using insert hudi operation, the error happens at Building workload 
> profile job. The error happens whether I run the job as Spark structured 
> streaming job or batch job.
> Find the batch job code snippet shared below. I obfuscated some values.
> from pyspark.sql import functions as f
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
>  
> def main():
>  
>     hudi_options = {
>         'hoodie.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
>         'hoodie.datasource.write.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.keygenerator.class': 
> 'org.apache.hudi.keygen.CustomKeyGenerator',
>         'hoodie.datasource.write.recordkey.field': 'id',
>         'hoodie.datasource.write.precombine.field': 'ts',
>         'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE',
>         'hoodie.embed.timeline.server': False,
>         'hoodie.index.type': 'SIMPLE',
>         'hoodie.parquet.compression.codec': 'snappy',
>         'hoodie.clean.async': True,
>         'hoodie.parquet.max.file.size': 125829120,
>         'hoodie.parquet.small.file.limit': 104857600,
>         'hoodie.parquet.block.size': 125829120,
>         'hoodie.metadata.enable': True,
>         'hoodie.metadata.validate': True,
>         "hoodie.datasource.write.hive_style_partitioning": True,
>         'hoodie.datasource.hive_sync.support_timestamp': True,
>         "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x",
>         'hoodie.datasource.hive_sync.username': 'xxx',
>         'hoodie.datasource.hive_sync.password': 'xxx',
>         "hoodie.datasource.hive_sync.database": "xxx",
>         "hoodie.datasource.hive_sync.table": "hudi_streaming_reco",
>         "hoodie.datasource.hive_sync.partition_fields": "insert_hr",
>         "hoodie.datasource.hive_sync.enable": True,
>         'hoodie.datasource.hive_sync.partition_extractor_class': 
> 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
>     }
>  
>     spark=SparkSession.builder.getOrCreate()
>  
>     inputPath = "s3://xxx/"
>  
>     transfomredDF = (
>         spark
>         .read
>         .text(inputPath, wholetext=True)
>         .select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                     f.col("value").alias("raw_data"),
>                     f.get_json_object(f.col("value"), "$._id").alias("id"),
>                     f.get_json_object(f.col("value"), 
> "$.metadata.createdDateTime").alias("ts"),
>                     f.input_file_name().alias("input_file_name"))
>     )
>  
>  
>  
>     s3_output_path = "s3://xxx/"
>     transfomredDF \
>     .write.format("hudi") \
>     .options(**hudi_options) \
>     .option('hoodie.datasource.write.operation', 'upsert') \
>     .save(s3_output_path,mode='append')
>  
> if __name__ == "__main__":
>     main()
>  
> Find the spark submit command used :
> spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf 
> spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.kryoserializer.buffer.max=512 --num-executors 5 --executor-cores 3 
> --executor-memory 10g --driver-memory 30g --name big_file_bat

[PR] make schema errors better [hudi]

2024-02-15 Thread via GitHub


jonvex opened a new pull request, #10677:
URL: https://github.com/apache/hudi/pull/10677

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big file (stream or batch job)

2024-02-15 Thread Haitham Eltaweel (Jira)
Haitham Eltaweel created HUDI-7412:
--

 Summary: OOM error after upgrade to hudi 0.13 when writing big 
file (stream or batch job)
 Key: HUDI-7412
 URL: https://issues.apache.org/jira/browse/HUDI-7412
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
 Environment: Amazon EMR version emr-6.11.1
Spark version 3.3.2 
Hive version 3.1.3
Hadoop version 3.3.3
hudi version 0.13
Reporter: Haitham Eltaweel


After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) 
can not be written to the destination location due to OOM error even after 
increasing Spark resources memory.

Find the error details: java.lang.OutOfMemoryError: Java heap space.

The error never happened when running same jab using hudi 0.11.


Find below the use case details:
Read one json file which has one record of 900MB from S3 source location, 
transform the DF then write the output DF to S3 target location. When using 
upsert hudi operation, the error happens at Tagging job ([mapToPair at 
HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0])
 and when using insert hudi operation, the error happens at Building workload 
profile job. The error happens whether I run the job as Spark structured 
streaming job or batch job.


Find the batch job code snippet shared below. I obfuscated some values.



from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
def main():
 
    hudi_options = {
        'hoodie.table.name': 'hudi_streaming_reco',
        'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
        'hoodie.datasource.write.table.name': 'hudi_streaming_reco',
        'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE',
        'hoodie.embed.timeline.server': False,
        'hoodie.index.type': 'SIMPLE',
        'hoodie.parquet.compression.codec': 'snappy',
        'hoodie.clean.async': True,
        'hoodie.parquet.max.file.size': 125829120,
        'hoodie.parquet.small.file.limit': 104857600,
        'hoodie.parquet.block.size': 125829120,
        'hoodie.metadata.enable': True,
        'hoodie.metadata.validate': True,
        "hoodie.datasource.write.hive_style_partitioning": True,
        'hoodie.datasource.hive_sync.support_timestamp': True,
        "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x",
        'hoodie.datasource.hive_sync.username': 'xxx',
        'hoodie.datasource.hive_sync.password': 'xxx',
        "hoodie.datasource.hive_sync.database": "xxx",
        "hoodie.datasource.hive_sync.table": "hudi_streaming_reco",
        "hoodie.datasource.hive_sync.partition_fields": "insert_hr",
        "hoodie.datasource.hive_sync.enable": True,
        'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
    }
 
    spark=SparkSession.builder.getOrCreate()
 
    inputPath = "s3://xxx/"
 
    transfomredDF = (
        spark
        .read
        .text(inputPath, wholetext=True)
        .select(f.date_format(f.current_timestamp(), 
'MMddHH').astype('string').alias('insert_hr'),
                    f.col("value").alias("raw_data"),
                    f.get_json_object(f.col("value"), "$._id").alias("id"),
                    f.get_json_object(f.col("value"), 
"$.metadata.createdDateTime").alias("ts"),
                    f.input_file_name().alias("input_file_name"))
    )
 
 
 
    s3_output_path = "s3://xxx/"
    transfomredDF \
    .write.format("hudi") \
    .options(**hudi_options) \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .save(s3_output_path,mode='append')
 
if __name__ == "__main__":
    main()
 
Find the spark submit command used :
spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf 
spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.kryoserializer.buffer.max=512 --num-executors 5 --executor-cores 3 
--executor-memory 10g --driver-memory 30g --name big_file_batch --queue 
casualty --deploy-mode cluster big_record_test.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7412) OOM error after upgrade to hudi 0.13 when writing big record (stream or batch job)

2024-02-15 Thread Haitham Eltaweel (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haitham Eltaweel updated HUDI-7412:
---
Summary: OOM error after upgrade to hudi 0.13 when writing big record 
(stream or batch job)  (was: OOM error after upgrade to hudi 0.13 when writing 
big file (stream or batch job))

> OOM error after upgrade to hudi 0.13 when writing big record (stream or batch 
> job)
> --
>
> Key: HUDI-7412
> URL: https://issues.apache.org/jira/browse/HUDI-7412
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
> Environment: Amazon EMR version emr-6.11.1
> Spark version 3.3.2 
> Hive version 3.1.3
> Hadoop version 3.3.3
> hudi version 0.13
>Reporter: Haitham Eltaweel
>Priority: Major
>
> After upgrading from hudi 0.11 to hudi 0.13. Big records (larger than 200MB) 
> can not be written to the destination location due to OOM error even after 
> increasing Spark resources memory.
> Find the error details: java.lang.OutOfMemoryError: Java heap space.
> The error never happened when running same jab using hudi 0.11.
> Find below the use case details:
> Read one json file which has one record of 900MB from S3 source location, 
> transform the DF then write the output DF to S3 target location. When using 
> upsert hudi operation, the error happens at Tagging job ([mapToPair at 
> HoodieJavaRDD.java:135|http://ip-10-18-73-98.ec2.internal:20888/proxy/application_1705084455183_108018/stages/stage/?id=2&attempt=0])
>  and when using insert hudi operation, the error happens at Building workload 
> profile job. The error happens whether I run the job as Spark structured 
> streaming job or batch job.
> Find the batch job code snippet shared below. I obfuscated some values.
> from pyspark.sql import functions as f
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
>  
> def main():
>  
>     hudi_options = {
>         'hoodie.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
>         'hoodie.datasource.write.table.name': 'hudi_streaming_reco',
>         'hoodie.datasource.write.keygenerator.class': 
> 'org.apache.hudi.keygen.CustomKeyGenerator',
>         'hoodie.datasource.write.recordkey.field': 'id',
>         'hoodie.datasource.write.precombine.field': 'ts',
>         'hoodie.datasource.write.partitionpath.field': 'insert_hr:SIMPLE',
>         'hoodie.embed.timeline.server': False,
>         'hoodie.index.type': 'SIMPLE',
>         'hoodie.parquet.compression.codec': 'snappy',
>         'hoodie.clean.async': True,
>         'hoodie.parquet.max.file.size': 125829120,
>         'hoodie.parquet.small.file.limit': 104857600,
>         'hoodie.parquet.block.size': 125829120,
>         'hoodie.metadata.enable': True,
>         'hoodie.metadata.validate': True,
>         "hoodie.datasource.write.hive_style_partitioning": True,
>         'hoodie.datasource.hive_sync.support_timestamp': True,
>         "hoodie.datasource.hive_sync.jdbcurl": "jdbc:hive2://xx:x",
>         'hoodie.datasource.hive_sync.username': 'xxx',
>         'hoodie.datasource.hive_sync.password': 'xxx',
>         "hoodie.datasource.hive_sync.database": "xxx",
>         "hoodie.datasource.hive_sync.table": "hudi_streaming_reco",
>         "hoodie.datasource.hive_sync.partition_fields": "insert_hr",
>         "hoodie.datasource.hive_sync.enable": True,
>         'hoodie.datasource.hive_sync.partition_extractor_class': 
> 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
>     }
>  
>     spark=SparkSession.builder.getOrCreate()
>  
>     inputPath = "s3://xxx/"
>  
>     transfomredDF = (
>         spark
>         .read
>         .text(inputPath, wholetext=True)
>         .select(f.date_format(f.current_timestamp(), 
> 'MMddHH').astype('string').alias('insert_hr'),
>                     f.col("value").alias("raw_data"),
>                     f.get_json_object(f.col("value"), "$._id").alias("id"),
>                     f.get_json_object(f.col("value"), 
> "$.metadata.createdDateTime").alias("ts"),
>                     f.input_file_name().alias("input_file_name"))
>     )
>  
>  
>  
>     s3_output_path = "s3://xxx/"
>     transfomredDF \
>     .write.format("hudi") \
>     .options(**hudi_options) \
>     .option('hoodie.datasource.write.operation', 'upsert') \
>     .save(s3_output_path,mode='append')
>  
> if __name__ == "__main__":
>     main()
>  
> Find the spark submit command used :
> spark-submit --master yarn --conf spark.driver.userClassPathFirst=true --conf 
> spark.jars.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.kryoserializer.buffer.max=512 --num-executors 5 --exe

Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


linliu-code commented on code in PR #10674:
URL: https://github.com/apache/hudi/pull/10674#discussion_r1491251501


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestHoodieCompactor.java:
##
@@ -195,19 +195,18 @@ public void testWriteStatusContentsAfterCompaction() 
throws Exception {
   String newCommitTime = "100";
   writeClient.startCommitWithTime(newCommitTime);
 
-  List records = dataGen.generateInserts(newCommitTime, 100);
+  List records = dataGen.generateInserts(newCommitTime, 
1000);
   JavaRDD recordsRDD = jsc.parallelize(records, 1);
   writeClient.insert(recordsRDD, newCommitTime).collect();
 
   // Update all the 100 records

Review Comment:
   100 -> 1000?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


jonvex commented on code in PR #10673:
URL: https://github.com/apache/hudi/pull/10673#discussion_r1491247516


##
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java:
##
@@ -238,7 +239,7 @@ private static HFileReader createReader(String hFilePath, 
FileSystem fileSystem)
   LOG.info("Opening HFile for reading :" + hFilePath);
   Path path = new Path(hFilePath);
   long fileSize = fileSystem.getFileStatus(path).getLen();
-  FSDataInputStream stream = fileSystem.open(path);
+  SeekableDataInputStream stream = new 
HadoopSeekableDataInputStream(fileSystem.open(path));

Review Comment:
   Is this going to be HadoopSeekableDataInputStream going forward? Or is 
hadoop going to be fully removed from here at some point?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10676:
URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946401504

   
   ## CI report:
   
   * d5f38b26cede75b6d07367cb661f0fd20256e3e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Athena does not support s3a partition scheme anymore leading to missing data [hudi]

2024-02-15 Thread via GitHub


codope closed issue #10595: [SUPPORT] Athena does not support s3a partition 
scheme anymore leading to missing data
URL: https://github.com/apache/hudi/issues/10595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7362] Fix hudi partition base path scheme to s3 (#10596)

2024-02-15 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6e6d66a7097 [HUDI-7362] Fix hudi partition base path scheme to s3 
(#10596)
6e6d66a7097 is described below

commit 6e6d66a70973a78d4f155bf13860c65565402930
Author: Nicolas Paris 
AuthorDate: Thu Feb 15 16:55:27 2024 +0100

[HUDI-7362] Fix hudi partition base path scheme to s3 (#10596)
---
 .../main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
index b814e353583..1e19b44a499 100644
--- 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
+++ 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
@@ -199,7 +199,7 @@ public class AWSGlueCatalogSyncClient extends 
HoodieSyncClient {
   Table table = getTable(awsGlue, databaseName, tableName);
   StorageDescriptor sd = table.storageDescriptor();
   List partitionInputs = 
partitionsToAdd.stream().map(partition -> {
-String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), 
partition).toString();
+String fullPartitionPath = 
FSUtils.getPartitionPath(s3aToS3(getBasePath()), partition).toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
 StorageDescriptor partitionSD = sd.copy(copySd -> 
copySd.location(fullPartitionPath));
 return 
PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build();
@@ -242,7 +242,7 @@ public class AWSGlueCatalogSyncClient extends 
HoodieSyncClient {
   Table table = getTable(awsGlue, databaseName, tableName);
   StorageDescriptor sd = table.storageDescriptor();
   List updatePartitionEntries = 
changedPartitions.stream().map(partition -> {
-String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), 
partition).toString();
+String fullPartitionPath = 
FSUtils.getPartitionPath(s3aToS3(getBasePath()), partition).toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
 StorageDescriptor partitionSD = sd.copy(copySd -> 
copySd.location(fullPartitionPath));
 PartitionInput partitionInput = 
PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build();



Re: [PR] [HUDI-7362] Fix hudi partition base path scheme to s3 [hudi]

2024-02-15 Thread via GitHub


codope merged PR #10596:
URL: https://github.com/apache/hudi/pull/10596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10676:
URL: https://github.com/apache/hudi/pull/10676#issuecomment-1946386360

   
   ## CI report:
   
   * d5f38b26cede75b6d07367cb661f0fd20256e3e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7411) Meta sync does not consider clean commits while syncing partitions

2024-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7411:
-
Labels: pull-request-available  (was: )

> Meta sync does not consider clean commits while syncing partitions
> --
>
> Key: HUDI-7411
> URL: https://issues.apache.org/jira/browse/HUDI-7411
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.2
>
>
> Cleaner could not delete partitions but meta sync fails to drop partition in 
> that case. This could cause query using engines that depend on catalog to 
> fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7411] Meta sync should consider cleaner commit [hudi]

2024-02-15 Thread via GitHub


codope opened a new pull request, #10676:
URL: https://github.com/apache/hudi/pull/10676

   ### Change Logs
   
   Cleaner could not delete partitions but meta sync fails to drop partition in 
that case. This could cause query using engines that depend on catalog to fail.
   
   TODO: I have only tested locally. I am going to add a test. 
   
   ### Impact
   
   Catalog will reflect correct partition metadata considering cleaner commits.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7411) Meta sync does not consider clean commits while syncing partitions

2024-02-15 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7411:
-

 Summary: Meta sync does not consider clean commits while syncing 
partitions
 Key: HUDI-7411
 URL: https://issues.apache.org/jira/browse/HUDI-7411
 Project: Apache Hudi
  Issue Type: Task
  Components: meta-sync
Reporter: Sagar Sumit
 Fix For: 1.0.0, 0.14.2


Cleaner could not delete partitions but meta sync fails to drop partition in 
that case. This could cause query using engines that depend on catalog to fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning (#10651)

2024-02-15 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f29811b1a4c [HUDI-7104] Fixing cleaner savepoint interplay to fix edge 
case with incremental cleaning (#10651)
f29811b1a4c is described below

commit f29811b1a4ca9121a5124d63ded147dba7b90b93
Author: Sivabalan Narayanan 
AuthorDate: Thu Feb 15 05:16:41 2024 -0800

[HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with 
incremental cleaning (#10651)

* Fixing incremental cleaning with savepoint

* Addressing feedback
---
 .../table/action/clean/CleanActionExecutor.java|   3 +-
 .../action/clean/CleanPlanActionExecutor.java  |  12 +-
 .../hudi/table/action/clean/CleanPlanner.java  | 116 --
 .../apache/hudi/table/action/TestCleanPlanner.java | 247 -
 .../hudi/utils/TestMetadataConversionUtils.java|   4 +-
 .../functional/TestExternalPathHandling.java   |   5 +-
 .../java/org/apache/hudi/table/TestCleaner.java|   7 +-
 .../testutils/HoodieSparkClientTestHarness.java|   4 +-
 hudi-common/src/main/avro/HoodieCleanMetadata.avsc |  11 +-
 hudi-common/src/main/avro/HoodieCleanerPlan.avsc   |  11 +-
 .../clean/CleanPlanV1MigrationHandler.java |   3 +-
 .../clean/CleanPlanV2MigrationHandler.java |   3 +-
 .../org/apache/hudi/common/util/CleanerUtils.java  |   5 +-
 .../table/view/TestIncrementalFSViewSync.java  |   2 +-
 .../hudi/common/testutils/HoodieTestTable.java |   8 +-
 .../hudi/common/util/TestClusteringUtils.java  |   6 +-
 16 files changed, 395 insertions(+), 52 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
index 40d91b63394..61c0eeeffb0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
@@ -219,7 +219,8 @@ public class CleanActionExecutor extends 
BaseActionExecutor extends BaseActionExecutor> {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(CleanPlanActionExecutor.class);
-
   private final Option> extraMetadata;
 
   public CleanPlanActionExecutor(HoodieEngineContext context,
@@ -142,12 +142,20 @@ public class CleanPlanActionExecutor extends 
BaseActionExecutor new HoodieActionInstant(x.getTimestamp(), x.getAction(), 
x.getState().name())).orElse(null),
   planner.getLastCompletedCommitTimestamp(),
   config.getCleanerPolicy().name(), Collections.emptyMap(),
-  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete);
+  CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, 
partitionsToDelete, prepareExtraMetadata(planner.getSavepointedTimestamps()));
 } catch (IOException e) {
   throw new HoodieIOException("Failed to schedule clean operation", e);
 }
   }
 
+  private Map prepareExtraMetadata(List 
savepointedTimestamps) {
+if (savepointedTimestamps.isEmpty()) {
+  return Collections.emptyMap();
+} else {
+  return Collections.singletonMap(SAVEPOINTED_TIMESTAMPS, 
savepointedTimestamps.stream().collect(Collectors.joining(",")));
+}
+  }
+
   /**
* Creates a Cleaner plan if there are files to be cleaned and stores them 
in instant file.
* Cleaner Plan contains absolute file paths.
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
index 0dd516a88d1..19cbe0f91a7 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
@@ -41,6 +41,7 @@ import 
org.apache.hudi.common.table.view.HoodieTableFileSystemView;
 import org.apache.hudi.common.table.view.SyncableFileSystemView;
 import org.apache.hudi.common.util.CleanerUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieIOException;
@@ -55,6 +56,7 @@ import java.io.IOException;
 import java.io.Serializable;
 import java.time.Instant;
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.Collections;
 import java.util.Iterator;
 import java.util.List;
@@ -78,6 +80,7 @@ public class CleanPlanner implements Serializable 
{
   public static final Integer CLEAN_PLAN_VERSION_1 = 
CleanPlanV1MigrationHandler.VERSION;
   

Re: [PR] [HUDI-7104] Fixing cleaner savepoint interplay to fix edge case with incremental cleaning [hudi]

2024-02-15 Thread via GitHub


codope merged PR #10651:
URL: https://github.com/apache/hudi/pull/10651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Unable to insert record into Hudi table using Hudi Spark Connector through Golang [hudi]

2024-02-15 Thread via GitHub


Shekkylar opened a new issue, #10675:
URL: https://github.com/apache/hudi/issues/10675

   ## Issue Summary
   
   Encountering challenges while integrating the Hudi Spark Connector with 
Golang. Insert, update, and upsert queries are resulting in errors, while 
create table and select queries work without issues.
   
   ## Environment
   
   - Java 8 
   - EMR cluster emr-version-7.0
   - Spark version 3.5.0
   - Spark Connector server started on port 15002 
   - Golang-v1.21.7 used to connect to Spark locally via SSH tunneling
   - Glue meta store for catalog
   
   
   ### Start Spark Server Command
   
   Executed the following command in AWS EMR CLI within Spark:
   
   ```bash
   cd /usr/lib/spark
   ./sbin/start-connect-server.sh \
 --packages org.apache.spark:spark-connect_2.12:3.5.0 \
 --jars /usr/lib/hudi/hudi-spark-bundle.jar
 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
 --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
  \
 --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
 --conf 
"spark.sql.catalog.aws.glue.sync.tool.classes=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool"
   
   
   # Golang Code Snippet:
   
   package main
   
   import (
"fmt"
"log"
   
"github.com/apache/spark-connect-go/v34/client/sql"
   )
   
   func main() {
remote := "sc://localhost:8157"
spark, err := sql.SparkSession.Builder.Remote(remote).Build()
if err != nil {
fmt.Println(err)
log.Fatal("Failed to connect to Spark:", err)
}
   
// Example SQL query to show all tables
query := "SHOW TABLES"
alltab, err := spark.Sql(query)
if err != nil {
log.Fatal("Failed to execute SQL query:", err)
}
   
// Show the result
alltab.Show(10, true)
   

//Create the Hudi table with the basic schema
_, err = spark.Sql(`create table hudi_table (
id bigint,
name string,
dt string
  ) using hudi
  LOCATION "s3://spark-hudi-table/output/"
  TBLPROPERTIES (
type = "cow",
primaryKey = "id"
  )
  partitioned by (dt);`)
if err != nil {
fmt.Println("failed to Create", err)
}
   
//Insert data into the Hudi table
_, err = spark.Sql(`insert into default.hudi_table (id, name, dt) 
VALUES (1, 'test 1', '2023-11-11'), (2, 'test 2', '2023-11-12');`)
if err != nil {
fmt.Println("Failed to insert data into Hudi table:", err)
}
   
//Query the Hudi table
result, err := spark.Sql("SELECT * FROM hudi_table")
if err != nil {
fmt.Println("Failed to query Hudi table:", err)
}
   
// // Show the result
result.Show(10, true)
   
// Stop the Spark session
spark.Stop()
   }
   
   # Issue Details:
   While executing insert query, the Spark job fails with the following error 
taken fron Spark connector logs:
   
   #encountered errors:
   ```
   24/02/15 12:43:14 INFO Javalin: Starting Javalin ...
   24/02/15 12:43:14 INFO Javalin: You are running Javalin 4.6.7 (released 
October 24, 2022. Your Javalin version is 479 days old. Consider checking for a 
newer version.).
   24/02/15 12:43:14 INFO Javalin: Listening on http://localhost:39459/
   24/02/15 12:43:14 INFO Javalin: Javalin started in 151ms \o/
   24/02/15 12:43:14 INFO CodeGenerator: Code generated in 14.79973 ms
   24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:15 INFO MultipartUploadOutputStream: close closed:false 
s3://spark-hudi-table/output/.hoodie/20240215124314041.commit.requested
   24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading
   24/02/15 12:43:16 INFO MultipartUploadOutputStream: close closed:false 
s3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties
   24/02/15 12:43:17 INFO S3NativeFileSystem: Opening 
's3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties' for 
reading
   24/02/15 12:43:17 INFO SparkContext: Starting job: Spark Connect - 
session_id: "66f20158-e2df-4941-b6f4-4565c534143b"
   user_context {
 user_i

Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1946043346

   
   ## CI report:
   
   * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10673:
URL: https://github.com/apache/hudi/pull/10673#issuecomment-1946043277

   
   ## CI report:
   
   * 181c55e683edcc1743c39e955433c3bc24976883 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22454)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix zookeeper session expiration bug [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10671:
URL: https://github.com/apache/hudi/pull/10671#issuecomment-1945950025

   
   ## CI report:
   
   * 004644210da7a22dc129147a18a147869cf220f2 UNKNOWN
   * 54a3aa144f76a8ec31e9f0493cfcdcf4eca8802d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1945746116

   
   ## CI report:
   
   * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22455)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10674:
URL: https://github.com/apache/hudi/pull/10674#issuecomment-1945733022

   
   ## CI report:
   
   * 5e09204bc2a3378a0cc248c8c87499c7c1a6ce86 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10673:
URL: https://github.com/apache/hudi/pull/10673#issuecomment-1945732958

   
   ## CI report:
   
   * 181c55e683edcc1743c39e955433c3bc24976883 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22454)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7410] Use SeekableDataInputStream as the input of native HFile reader [hudi]

2024-02-15 Thread via GitHub


hudi-bot commented on PR #10673:
URL: https://github.com/apache/hudi/pull/10673#issuecomment-1945720924

   
   ## CI report:
   
   * 181c55e683edcc1743c39e955433c3bc24976883 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] High runtime for a batch in SparkWriteHelper stage [hudi]

2024-02-15 Thread via GitHub


devjain47 commented on issue #6014:
URL: https://github.com/apache/hudi/issues/6014#issuecomment-1945693518

   @ad1happy2go , almost 20 GB data is present


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7381] Fix flaky test introduced in PR 10619 [hudi]

2024-02-15 Thread via GitHub


rmahindra123 opened a new pull request, #10674:
URL: https://github.com/apache/hudi/pull/10674

   ### Change Logs
   
   Fix flaky test introduced in PRhttps://github.com/apache/hudi/pull/10619
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   Medium 
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >