[I] [SUPPORT] HoodieMultiTableDeltaStreamer does not work as expected [hudi]

2023-12-04 Thread via GitHub


nttq1sub opened a new issue, #10246:
URL: https://github.com/apache/hudi/issues/10246

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   When I run HoodieMultiTableDeltaStreamer with spark-on-k8s-operator. I saw 
it run but just one table that have data to fill in, how should I config to run 
it perfectly with more than 2 tables. How does it work when run on k8s cluster. 
Does 1 driver handle multiple tables or 1 driver each table. Does it process 
sequentially or parrallel on driver ? Thanks so much if have anyone explain 
these points for me?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.2
   
   * Hive version : 3.1.1
   
   * Hadoop version : 2.3.0
   
   * Storage (HDFS/S3/GCS..) : hdfs
   
   * Running on Docker? (yes/no) : kubenetes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Error when overwrite and synchronize hive metastore with 0.14.0 [hudi]

2023-12-04 Thread via GitHub


xicm commented on issue #10170:
URL: https://github.com/apache/hudi/issues/10170#issuecomment-1840178727

   The infer function has been fixed, https://github.com/apache/hudi/pull/9816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7078) Re-enable one test in TestNestedSchemaPruningOptimization

2023-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7078:
-
Labels: pull-request-available  (was: )

> Re-enable one test in TestNestedSchemaPruningOptimization
> -
>
> Key: HUDI-7078
> URL: https://issues.apache.org/jira/browse/HUDI-7078
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>
> Currently "Test NestedSchemaPruning optimization unsuccessful" is disabled.  
> We need to triage the issue with new file format and file group reader and 
> re-enable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7078] Re-enable TestNestedSchemaPruningOptimization [hudi]

2023-12-04 Thread via GitHub


linliu-code opened a new pull request, #10245:
URL: https://github.com/apache/hudi/pull/10245

   ### Change Logs
   
   Just try to re-enable the test.
   
   ### Impact
   
   Fixing the bugs.
   
   ### Risk level (write none, low medium or high below)
   
   LOW
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Error when overwrite and synchronize hive metastore with 0.14.0 [hudi]

2023-12-04 Thread via GitHub


xicm commented on issue #10170:
URL: https://github.com/apache/hudi/issues/10170#issuecomment-1840157102

   `META_SYNC_DATABASE_NAME` is inferred from `hoodie.database.name`,  You can 
check that the property `hoodie.table.database` in hoodie.properties is empty.
   
   Nice avatar. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840147988

   
   ## CI report:
   
   * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN
   * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197)
 
   * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302)
 
   * 82bf22c1772449bf32bdd9c98d72c273cd938487 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21306)
 
   * ce28d60f139e8baac53710b32774d38677abf37f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840139680

   
   ## CI report:
   
   * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN
   * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197)
 
   * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302)
 
   * 82bf22c1772449bf32bdd9c98d72c273cd938487 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21306)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10244:
URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840140053

   
   ## CI report:
   
   * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303)
 
   * 3763acfffaf1ac4760865f11edeb1cf91a91942c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21308)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10242:
URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840139973

   
   ## CI report:
   
   * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21298)
 
   * b23c47f2029f3c9cbeecc3608c6dc00c2af684e9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1840139738

   
   ## CI report:
   
   * 5e6fbb9988501485e6b66964f8dff66e8f0d4e50 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21264)
 
   * 574d9561fdf35a76412a1f1d968b0588be2454f9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21307)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10244:
URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840131352

   
   ## CI report:
   
   * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303)
 
   * 3763acfffaf1ac4760865f11edeb1cf91a91942c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10176:
URL: https://github.com/apache/hudi/pull/10176#issuecomment-1840131072

   
   ## CI report:
   
   * 5e6fbb9988501485e6b66964f8dff66e8f0d4e50 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21264)
 
   * 574d9561fdf35a76412a1f1d968b0588be2454f9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840131015

   
   ## CI report:
   
   * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN
   * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197)
 
   * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302)
 
   * 82bf22c1772449bf32bdd9c98d72c273cd938487 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1840130915

   
   ## CI report:
   
   * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294)
 
   * 3252779dc0eecf1c7b455125f6ca116d540efed9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21305)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]

2023-12-04 Thread via GitHub


wecharyu commented on PR #10242:
URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840126819

   cc: @boneanxs @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840123143

   
   ## CI report:
   
   * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN
   * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197)
 
   * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1840123043

   
   ## CI report:
   
   * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294)
 
   * 3252779dc0eecf1c7b455125f6ca116d540efed9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-04 Thread via GitHub


hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1414953701


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   > Hmm, maybe we just fix the table type as to be in line with the 
hoodie.properties when there is inconsistency instead of throwing, WDYT ?
   
   
   Hi @danny0405 ,  It's ok.  
   
   But when inconsistency occurs, users may not be aware of them.
   
   If you recommend this way, i will fix the table type as to be in line with 
the hoodie.properties



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   It's ok.  
   But when inconsistency occurs, users may not be aware of them.
   
   If you recommend this way, i will fix the table type as to be in line with 
the hoodie.properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Fixing view manager reuse with Embedded timeline server (#10240)

2023-12-04 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 70a2064525a [MINOR] Fixing view manager reuse with Embedded timeline 
server (#10240)
70a2064525a is described below

commit 70a2064525a26abc57a33c019da2ccb520182ef5
Author: Sivabalan Narayanan 
AuthorDate: Mon Dec 4 22:45:39 2023 -0800

[MINOR] Fixing view manager reuse with Embedded timeline server (#10240)
---
 .../java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
index 5432e9b34ef..b89b5cdfa11 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
@@ -182,7 +182,7 @@ public class EmbeddedTimelineService {
 this.serviceConfig = timelineServiceConfBuilder.build();
 
 server = timelineServiceCreator.create(context, hadoopConf.newCopy(), 
serviceConfig,
-FSUtils.getFs(writeConfig.getBasePath(), hadoopConf.newCopy()), 
createViewManager());
+FSUtils.getFs(writeConfig.getBasePath(), hadoopConf.newCopy()), 
viewManager);
 serverPort = server.startService();
 LOG.info("Started embedded timeline server at " + hostAddr + ":" + 
serverPort);
   }



Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]

2023-12-04 Thread via GitHub


codope merged PR #10240:
URL: https://github.com/apache/hudi/pull/10240


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]

2023-12-04 Thread via GitHub


hehuiyuan commented on code in PR #10209:
URL: https://github.com/apache/hudi/pull/10209#discussion_r1414949932


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception {
 
 // case2: empty table without data files
 Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");

Review Comment:
   It's ok.  
   But when inconsistency occurs, users may not be aware of them.
   
   If you recommend this way, i will fix the table type as to be in line with 
the hoodie.properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10244:
URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840083004

   
   ## CI report:
   
   * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10167:
URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840082759

   
   ## CI report:
   
   * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN
   * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197)
 
   * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10244:
URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840076440

   
   ## CI report:
   
   * 2a078bb478c90005df6e65793b969a4a8765f13f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


linliu-code commented on code in PR #10144:
URL: https://github.com/apache/hudi/pull/10144#discussion_r1414927204


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -310,6 +317,15 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   _: PartitionedFile => Iterator.empty
 }
 
+// Note that for CDC reader, the underlying data schema is stored in the 
'options' to separate from the CDC schema.
+val rawDataSchemaStr = options.getOrElse(rawDataSchema, "")

Review Comment:
   Changed the code to use table schema instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10182:
URL: https://github.com/apache/hudi/issues/10182#issuecomment-1840059697

   @sayanpaul-plaid I will look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10231:
URL: https://github.com/apache/hudi/issues/10231#issuecomment-1840058081

   Sure @soumilshah1995. let's connect tomorrow morning on the same. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Large gap between stages on read [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10239:
URL: https://github.com/apache/hudi/issues/10239#issuecomment-1840055732

   @noahtaite Are you setting  'hoodie.metadata.enable' explicitly for the 
readers. It is by default disabled for the readers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7172) Fix the timeline archiver to support concurrent writer

2023-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7172:
-
Labels: pull-request-available  (was: )

> Fix the timeline archiver to support concurrent writer
> --
>
> Key: HUDI-7172
> URL: https://issues.apache.org/jira/browse/HUDI-7172
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]

2023-12-04 Thread via GitHub


danny0405 opened a new pull request, #10244:
URL: https://github.com/apache/hudi/pull/10244

   ### Change Logs
   
   This is a regression of https://github.com/apache/hudi/pull/9209.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7172) Fix the timeline archiver to support concurrent writer

2023-12-04 Thread Danny Chen (Jira)
Danny Chen created HUDI-7172:


 Summary: Fix the timeline archiver to support concurrent writer
 Key: HUDI-7172
 URL: https://issues.apache.org/jira/browse/HUDI-7172
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] Fixing integ test writer for commit time generation [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10243:
URL: https://github.com/apache/hudi/pull/10243#issuecomment-1840022044

   
   ## CI report:
   
   * 9492a5b78c6a28dde8b43c6a8e4053020cb11414 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21301)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fixing integ test writer for commit time generation [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10243:
URL: https://github.com/apache/hudi/pull/10243#issuecomment-1840015651

   
   ## CI report:
   
   * 9492a5b78c6a28dde8b43c6a8e4053020cb11414 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10242:
URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840015610

   
   ## CI report:
   
   * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21298)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6979) support EventTimeBasedCompactionStrategy

2023-12-04 Thread Kong Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei updated HUDI-6979:
---
Status: In Progress  (was: Open)

> support EventTimeBasedCompactionStrategy
> 
>
> Key: HUDI-6979
> URL: https://issues.apache.org/jira/browse/HUDI-6979
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> The current compaction strategies are based on the logfile size, the number 
> of logfile files, etc. The data time of the RO table generated by these 
> strategies is uncontrollable. Hudi also has a DayBased strategy, but it 
> relies on day based partition path and the time granularity is coarse.
> The *EventTimeBasedCompactionStrategy* strategy can generate event 
> time-friendly RO tables, whether it is day based partition or not. For 
> example, the strategy can select all logfiles whose data time is before 3 am 
> for compaction, so that the generated RO table data is before 3 am. If we 
> just want to query data before 3 am, we can just query the RO table which is 
> much faster.
> With the strategy, I think we can expand the application scenarios of RO 
> tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-6980] Fixing closing of write client on failure scenarios (#10224)

2023-12-04 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ccd621b258 [HUDI-6980] Fixing closing of write client on failure 
scenarios (#10224)
0ccd621b258 is described below

commit 0ccd621b2582e3d40811dd8b803f072747ffa5c9
Author: Sivabalan Narayanan 
AuthorDate: Mon Dec 4 20:20:34 2023 -0800

[HUDI-6980] Fixing closing of write client on failure scenarios (#10224)
---
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 33 ++
 .../timeline/service/handlers/MarkerHandler.java   |  4 +--
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 7c4ec8a71e7..bab0448642c 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -365,7 +365,7 @@ class HoodieSparkSqlWriterInternal {
 }
   }
 
-  val (writeResult, writeClient: SparkRDDWriteClient[_]) =
+  val (writeResult: HoodieWriteResult, writeClient: 
SparkRDDWriteClient[_]) =
 operation match {
   case WriteOperationType.DELETE | WriteOperationType.DELETE_PREPPED =>
 mayBeValidateParamsForAutoGenerationOfRecordKeys(parameters, 
hoodieConfig)
@@ -509,9 +509,16 @@ class HoodieSparkSqlWriterInternal {
 hoodieRecords
   }
 client.startCommitWithTime(instantTime, commitActionType)
-val writeResult = DataSourceUtils.doWriteOperation(client, 
dedupedHoodieRecords, instantTime, operation,
-  preppedSparkSqlWrites || preppedWriteOperation)
-(writeResult, client)
+try {
+  val writeResult = DataSourceUtils.doWriteOperation(client, 
dedupedHoodieRecords, instantTime, operation,
+preppedSparkSqlWrites || preppedWriteOperation)
+  (writeResult, client)
+} catch {
+  case e: HoodieException =>
+// close the write client in all cases
+handleWriteClientClosure(client, tableConfig, parameters, 
jsc.hadoopConfiguration())
+throw e
+}
 }
 
   // Check for errors and commit the write.
@@ -524,17 +531,21 @@ class HoodieSparkSqlWriterInternal {
 
 (writeSuccessful, common.util.Option.ofNullable(instantTime), 
compactionInstant, clusteringInstant, writeClient, tableConfig)
   } finally {
-// close the write client in all cases
-val asyncCompactionEnabled = isAsyncCompactionEnabled(writeClient, 
tableConfig, parameters, jsc.hadoopConfiguration())
-val asyncClusteringEnabled = isAsyncClusteringEnabled(writeClient, 
parameters)
-if (!asyncCompactionEnabled && !asyncClusteringEnabled) {
-  log.info("Closing write client")
-  writeClient.close()
-}
+handleWriteClientClosure(writeClient, tableConfig, parameters, 
jsc.hadoopConfiguration())
   }
 }
   }
 
+  private def handleWriteClientClosure(writeClient: SparkRDDWriteClient[_], 
tableConfig : HoodieTableConfig, parameters: Map[String, String], 
configuration: Configuration): Unit =  {
+// close the write client in all cases
+val asyncCompactionEnabled = isAsyncCompactionEnabled(writeClient, 
tableConfig, parameters, configuration)
+val asyncClusteringEnabled = isAsyncClusteringEnabled(writeClient, 
parameters)
+if (!asyncCompactionEnabled && !asyncClusteringEnabled) {
+  log.warn("Closing write client")
+  writeClient.close()
+}
+  }
+
   def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : 
Map[String, String], df: Dataset[Row]): WriteOperationType = {
 var operation = 
WriteOperationType.fromValue(hoodieConfig.getString(OPERATION))
 // TODO clean up
diff --git 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
index 390a4e2184f..42e2f40e629 100644
--- 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
+++ 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java
@@ -126,8 +126,8 @@ public class MarkerHandler extends Handler {
 if (dispatchingThreadFuture != null) {
   dispatchingThreadFuture.cancel(true);
 }
-dispatchingExecutorService.shutdown();
-batchingExecutorService.shutdown();
+dispatchingExecutorService.shutdownNow();
+batchingExecutorService.shutdownNow();
   }
 
   /**

Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]

2023-12-04 Thread via GitHub


nsivabalan merged PR #10224:
URL: https://github.com/apache/hudi/pull/10224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (315924a3b6e -> e2b695abbdf)

2023-12-04 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 315924a3b6e [HUDI-7166] Provide a Procedure to Calculate Column Stats 
Overlap Degree (#10226)
 add e2b695abbdf [HUDI-7100] Fixing insert overwrite operations with drop 
dups config (#10222)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  2 +-
 .../apache/hudi/functional/TestCOWDataSource.scala | 78 ++
 2 files changed, 79 insertions(+), 1 deletion(-)



Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10242:
URL: https://github.com/apache/hudi/pull/10242#issuecomment-1839978754

   
   ## CI report:
   
   * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]

2023-12-04 Thread via GitHub


nsivabalan merged PR #10222:
URL: https://github.com/apache/hudi/pull/10222


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10240:
URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839978716

   
   ## CI report:
   
   * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839978473

   
   ## CI report:
   
   * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Fixing integ test writer for commit time generation [hudi]

2023-12-04 Thread via GitHub


nsivabalan opened a new pull request, #10243:
URL: https://github.com/apache/hudi/pull/10243

   ### Change Logs
   
   Fixing integ test writer for commit time generation
   
   ### Impact
   
   Fixing integ test writer for commit time generation
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7171) Fix 'show partitions' not display rewritten partitions

2023-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7171:
-
Labels: pull-request-available  (was: )

> Fix 'show partitions' not display rewritten partitions
> --
>
> Key: HUDI-7171
> URL: https://issues.apache.org/jira/browse/HUDI-7171
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> `show partitions` sql can not return correct result in following two cases:
>  # the dropped partitions should be displayed after they were recreated.
>  # after `insert overwrite` a partitioned table, the partitions should be 
> marked as `dropped`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]

2023-12-04 Thread via GitHub


wecharyu opened a new pull request, #10242:
URL: https://github.com/apache/hudi/pull/10242

   ### Change Logs
   `show partitions` sql can not return correct result in following two cases:
   1. the dropped partitions should be displayed after they were recreated.
   2. after `insert overwrite` a partitioned table, the partitions should be 
marked as `dropped`
   
   ### Impact
   
   bug fix.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]

2023-12-04 Thread via GitHub


stream2000 commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1414833910


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management strategies to hudi, people 
can config the strategies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+
+This proposal introduces Partition TTL Management service to hudi. TTL 
management is like other table services such as Clean/Compaction/Clustering.
+The user can config their ttl strategies through write configs and Hudi will 
help users find expired partitions and delete them automatically.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a `delete_partition` interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL strategies, find expired partitions and call call 
`delete_partition` by themselves. As the scale of installations grew, it is 
becoming increasingly important to implement a user-friendly TTL management 
mechanism for hudi.
+
+## Implementation
+
+Our main goals are as follows:
+
+* Providing an extensible framework for partition TTL management.
+* Implement a simple KEEP_BY_TIME strategy, which can be executed through 
independent Spark job, synchronous or asynchronous table services.
+
+### Strategy Definition
+
+The TTL strategies is similar to existing table service strategies. We can 
define TTL strategies like defining a clustering/clean/compaction strategy: 
+
+```properties
+hoodie.partition.ttl.management.strategy=KEEP_BY_TIME
+hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy
+hoodie.partition.ttl.days.retain=10
+```
+
+The config `hoodie.partition.ttl.management.strategy.class` is to provide a 
strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired 
partition paths to delete. And `hoodie.partition.ttl.days.retain` is the 
strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means 
that we will expire partitions that haven't been modified for this strategy 
value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in 
detail in the next section.
+
+The core definition of `PartitionTTLManagementStrategy` looks like this: 
+
+```java
+/**
+ * Strategy for partition-level TTL management.
+ */
+public abstract class PartitionTTLManagementStrategy {
+  /**
+   * Get expired partition paths for a specific partition ttl management 
strategy.
+   *
+   * @return Expired partition paths.
+   */
+  public abstract List getExpiredPartitionPaths();
+}
+```
+
+Users can provide their own implementation of `PartitionTTLManagementStrategy` 
and hudi will help delete the expired partitions.
+
+### KeepByTimeTTLManagementStrategy
+
+We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in 
the first version of partition TTL management implementation.
+
+The `KeepByTimePartitionTTLManagementStrategy` will calculate the 
`lastModifiedTime` for each input partitions. If duration between now and 
'lastModifiedTime' for the partition is larger than what 
`hoodie.partition.ttl.days.retain` configured, 
`KeepByTimePartitionTTLManagementStrategy` will mark this partition as an 
expired partition. We use day as the unit of expired time since it is very 
common-used for datalakes. Open to ideas for this. 
+
+we will to use the largest commit time of committed file groups in the 
partition as the partition's
+`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with 
larger instant time will change the partition's `lastModifiedTime`.

Review Comment:
   Again, leverage `.hoodie_partition_metadata` will bring format change, and 
it doesn't support any kind of transaction currently. As discussed with 
@danny0405 , In 1.0.0 and later version which supports efficient completion 
time queries on the timeline(#9565), we will have a more elegant way to get the 
`lastCommitTime`.
   You can see the the updated RFC for details. 



##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,209 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-582

[jira] [Created] (HUDI-7171) Fix 'show partitions' not display rewritten partitions

2023-12-04 Thread Wechar (Jira)
Wechar created HUDI-7171:


 Summary: Fix 'show partitions' not display rewritten partitions
 Key: HUDI-7171
 URL: https://issues.apache.org/jira/browse/HUDI-7171
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wechar
Assignee: Wechar


`show partitions` sql can not return correct result in following two cases:
 # the dropped partitions should be displayed after they were recreated.
 # after `insert overwrite` a partitioned table, the partitions should be 
marked as `dropped`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]

2023-12-04 Thread via GitHub


nfarah86 commented on issue #10182:
URL: https://github.com/apache/hudi/issues/10182#issuecomment-1839954297

   tagging @ad1happy2go 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10222:
URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839922168

   
   ## CI report:
   
   * 159b36a0f851c729e3ac7d690f2e0963dd17f85d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21293)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7170) Implement HFile reader independent of HBase

2023-12-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7170:
-
Labels: pull-request-available  (was: )

> Implement HFile reader independent of HBase
> ---
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We'd like to provide our own implementation o HFile reader which does not use 
> HBase dependencies.  In the long term, we should also decouple the HFile 
> reader from hadoop FileSystem abstractions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7170) Implement HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7170:

Summary: Implement HFile reader independent of HBase  (was: Add HFile 
reader independent of HBase)

> Implement HFile reader independent of HBase
> ---
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We'd like to provide our own implementation o HFile reader which does not use 
> HBase dependencies.  In the long term, we should also decouple the HFile 
> reader from hadoop FileSystem abstractions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]

2023-12-04 Thread via GitHub


yihua opened a new pull request, #10241:
URL: https://github.com/apache/hudi/pull/10241

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7170) Add HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7170:
---

 Summary: Add HFile reader independent of HBase
 Key: HUDI-7170
 URL: https://issues.apache.org/jira/browse/HUDI-7170
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7170:

Description: We'd like to provide our own implementation o HFile reader 
which does not use HBase dependencies.  In the long term, we should also 
decouple the HFile reader from hadoop FileSystem abstractions.

> Add HFile reader independent of HBase
> -
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We'd like to provide our own implementation o HFile reader which does not use 
> HBase dependencies.  In the long term, we should also decouple the HFile 
> reader from hadoop FileSystem abstractions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7170) Add HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7170:
---

Assignee: Ethan Guo

> Add HFile reader independent of HBase
> -
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7170:

Fix Version/s: 1.0.0

> Add HFile reader independent of HBase
> -
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase

2023-12-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7170:

Priority: Blocker  (was: Major)

> Add HFile reader independent of HBase
> -
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree

2023-12-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7166.

Resolution: Fixed

Fixed via master branch: 315924a3b6e2430be1c5662bacb696c8deae

> Provide a Procedure to Calculate Column Stats Overlap Degree
> 
>
> Key: HUDI-7166
> URL: https://issues.apache.org/jira/browse/HUDI-7166
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7110 , a tool has been made available to display column stats. 
> However, this tool is not very user-friendly for manual observation when 
> dealing with large data volumes. For instance, with tens of thousands of 
> parquet files, the number of rows in column stats could be of the order of 
> hundreds of thousands. This renders the data virtually unreadable to humans, 
> necessitating further processing by code. Yet, if an administrator simply 
> wishes to directly observe the data layout based on column stats under such 
> circumstances, a more intuitive display tool is required. Here, we offer a 
> tool that calculates the overlap degree of column stats based on partition 
> and column name.
>  
> Overlap degree refers to the extent to which the min-max ranges of different 
> files intersect with each other. This directly affects the effectiveness of 
> data skipping.
>  
> In fact, a similar concept is also provided by Snowflake to aid their 
> clustering process. 
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] 
> Our implementation here is not overly complex.
>  
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file 
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant 
> statistics.
>  
> For example, consider three files: a.parquet, b.parquet, and c.parquet. 
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap 
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values, 
> multiple files will be filtered out. For a simpler case, if it's an equality 
> query, 2 files will be filtered within these ranges, and no more than one 
> file will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree

2023-12-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7166:
-
Fix Version/s: 1.0.0

> Provide a Procedure to Calculate Column Stats Overlap Degree
> 
>
> Key: HUDI-7166
> URL: https://issues.apache.org/jira/browse/HUDI-7166
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7110 , a tool has been made available to display column stats. 
> However, this tool is not very user-friendly for manual observation when 
> dealing with large data volumes. For instance, with tens of thousands of 
> parquet files, the number of rows in column stats could be of the order of 
> hundreds of thousands. This renders the data virtually unreadable to humans, 
> necessitating further processing by code. Yet, if an administrator simply 
> wishes to directly observe the data layout based on column stats under such 
> circumstances, a more intuitive display tool is required. Here, we offer a 
> tool that calculates the overlap degree of column stats based on partition 
> and column name.
>  
> Overlap degree refers to the extent to which the min-max ranges of different 
> files intersect with each other. This directly affects the effectiveness of 
> data skipping.
>  
> In fact, a similar concept is also provided by Snowflake to aid their 
> clustering process. 
> [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] 
> Our implementation here is not overly complex.
>  
> It yields output similar to the following:
> |Partition path|Field name|Average overlap|Maximum file overlap|Total file 
> number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| |
> |path|c8|1.33|2|2|1|1|1|1|3| |
> This content provides a straightforward representation of the relevant 
> statistics.
>  
> For example, consider three files: a.parquet, b.parquet, and c.parquet. 
> Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
> is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap 
> within the ranges 3–5 and 7.
> If the filter conditions for 'id' during data skipping include these values, 
> multiple files will be filtered out. For a simpler case, if it's an equality 
> query, 2 files will be filtered within these ranges, and no more than one 
> file will be filtered in other cases (possibly outside of the range).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (92fc0c09192 -> 315924a3b6e)

2023-12-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 92fc0c09192 [HUDI-7165][FOLLOW-UP] Add test case for stopping 
heartbeat for un-committed events (#10230)
 add 315924a3b6e [HUDI-7166] Provide a Procedure to Calculate Column Stats 
Overlap Degree (#10226)

No new revisions were added by this update.

Summary of changes:
 .../hudi/metadata/HoodieTableMetadataUtil.java |  28 ++
 .../hudi/command/procedures/HoodieProcedures.scala |   1 +
 .../ShowColumnStatsOverlapProcedure.scala  | 338 +
 .../sql/hudi/procedure/TestMetadataProcedure.scala |  57 
 4 files changed, 424 insertions(+)
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


danny0405 merged PR #10226:
URL: https://github.com/apache/hudi/pull/10226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events (#10230)

2023-12-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 92fc0c09192 [HUDI-7165][FOLLOW-UP] Add test case for stopping 
heartbeat for un-committed events (#10230)
92fc0c09192 is described below

commit 92fc0c0919278b6e43a7c45b92c80be7a39525ec
Author: ksmou <135721692+ks...@users.noreply.github.com>
AuthorDate: Tue Dec 5 10:29:29 2023 +0800

[HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for 
un-committed events (#10230)
---
 .../sink/TestStreamWriteOperatorCoordinator.java   | 38 ++
 1 file changed, 38 insertions(+)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java
index 0f3d1947128..5cbe9899b8d 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java
@@ -19,7 +19,9 @@
 package org.apache.hudi.sink;
 
 import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.heartbeat.HoodieHeartbeatClient;
 import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieWriteStat;
 import org.apache.hudi.common.model.WriteConcurrencyMode;
@@ -65,7 +67,9 @@ import static org.hamcrest.CoreMatchers.is;
 import static org.hamcrest.CoreMatchers.startsWith;
 import static org.hamcrest.MatcherAssert.assertThat;
 import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
 import static org.junit.jupiter.api.Assertions.assertNull;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
@@ -185,6 +189,40 @@ public class TestStreamWriteOperatorCoordinator {
 assertThat("Recommits the instant with partial uncommitted events", 
lastCompleted, is(instant));
   }
 
+  @Test
+  public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws 
Exception {
+// reset
+reset();
+// override the default configuration
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY.key(), 
HoodieFailedWritesCleaningPolicy.LAZY.name());
+OperatorCoordinator.Context context = new 
MockOperatorCoordinatorContext(new OperatorID(), 1);
+coordinator = new StreamWriteOperatorCoordinator(conf, context);
+coordinator.start();
+coordinator.setExecutor(new MockCoordinatorExecutor(context));
+
+
assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy());
+
+final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0);
+
+// start one instant and not commit it
+coordinator.handleEventFromOperator(0, event0);
+String instant = coordinator.getInstant();
+HoodieHeartbeatClient heartbeatClient = 
coordinator.getWriteClient().getHeartbeatClient();
+assertNotNull(heartbeatClient.getHeartbeat(instant), "Heartbeat is 
missing");
+
+String basePath = tempFile.getAbsolutePath();
+HoodieWrapperFileSystem fs = 
coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs();
+
+assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is existed");
+
+// send bootstrap event to stop the heartbeat for this instant
+WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0);
+coordinator.handleEventFromOperator(0, event1);
+
+assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), 
"Heartbeat is stopped and cleared");
+  }
+
   @Test
   public void testRecommitWithLazyFailedWritesCleanPolicy() {
 
coordinator.getWriteClient().getConfig().setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY,
 HoodieFailedWritesCleaningPolicy.LAZY.name());



Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]

2023-12-04 Thread via GitHub


danny0405 merged PR #10230:
URL: https://github.com/apache/hudi/pull/10230


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10240:
URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839837203

   
   ## CI report:
   
   * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839836771

   
   ## CI report:
   
   * 1027df0a1aa63ef976ce5ba4494af252ff8faed0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21281)
 
   * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10240:
URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839829542

   
   ## CI report:
   
   * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10144:
URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839829264

   
   ## CI report:
   
   * 1027df0a1aa63ef976ce5ba4494af252ff8faed0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21281)
 
   * 363470311395f04bdd0462bc058a9b25bd94bc9f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10222:
URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839823227

   
   ## CI report:
   
   * b53a22922751b4744c96c07666bcb5ba13e2cb60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21256)
 
   * 159b36a0f851c729e3ac7d690f2e0963dd17f85d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21293)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]

2023-12-04 Thread via GitHub


nsivabalan opened a new pull request, #10240:
URL: https://github.com/apache/hudi/pull/10240

   ### Change Logs
   
   Fixing view manager reuse with Embedded timeline server
   
   ### Impact
   
   Fixing view manager reuse with Embedded timeline server
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


linliu-code commented on code in PR #10144:
URL: https://github.com/apache/hudi/pull/10144#discussion_r1414693685


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -310,6 +317,15 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   _: PartitionedFile => Iterator.empty
 }
 
+// Note that for CDC reader, the underlying data schema is stored in the 
'options' to separate from the CDC schema.
+val rawDataSchemaStr = options.getOrElse(rawDataSchema, "")

Review Comment:
   Good question, we can read the schema from here directly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


linliu-code commented on code in PR #10144:
URL: https://github.com/apache/hudi/pull/10144#discussion_r1414691276


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCDCFileIndex.scala:
##
@@ -42,29 +42,34 @@ class HoodieCDCFileIndex (override val spark: SparkSession,
   extends HoodieIncrementalFileIndex(
 spark, metaClient, schemaSpec, options, fileStatusCache, includeLogFiles, 
shouldEmbedFileSlices
   ) with FileIndex {
+  private val emptyPartitionPath: String = "empty_partition_path";
   val cdcRelation: CDCRelation = CDCRelation.getCDCRelation(spark.sqlContext, 
metaClient, options)
   val cdcExtractor: HoodieCDCExtractor = cdcRelation.cdcExtractor
 
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
-val partitionToFileGroups = 
cdcExtractor.extractCDCFileSplits().asScala.groupBy(_._1.getPartitionPath).toSeq
-partitionToFileGroups.map {
-  case (partitionPath, fileGroups) =>
-val fileGroupIds: List[FileStatus] = fileGroups.map { fileGroup => {
-  // We create a fake FileStatus to wrap the information of 
HoodieFileGroupId, which are used
-  // later to retrieve the corresponding CDC file group splits.
-  val fileGroupId: HoodieFileGroupId = fileGroup._1
-  new FileStatus(0, true, 0, 0, 0,
-0, null, "", "", null,
-new Path(fileGroupId.getPartitionPath, fileGroupId.getFileId))
-}}.toList
-val partitionValues: InternalRow = new 
GenericInternalRow(doParsePartitionColumnValues(
-  metaClient.getTableConfig.getPartitionFields.get(), 
partitionPath).asInstanceOf[Array[Any]])
+cdcExtractor.extractCDCFileSplits().asScala.map {
+  case (fileGroupId, fileSplits) =>
+val partitionPath = if (fileGroupId.getPartitionPath.isEmpty) 
emptyPartitionPath else fileGroupId.getPartitionPath

Review Comment:
   Here we cannot use empty string since Line 63 requires the partition_path to 
be not empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10222:
URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839787728

   
   ## CI report:
   
   * b53a22922751b4744c96c07666bcb5ba13e2cb60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21256)
 
   * 159b36a0f851c729e3ac7d690f2e0963dd17f85d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10224:
URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839680195

   
   ## CI report:
   
   * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]

2023-12-04 Thread via GitHub


mzheng-plaid commented on issue #9977:
URL: https://github.com/apache/hudi/issues/9977#issuecomment-1839667223

   @ad1happy2go @codope I was able to reproduce with the following Spark code 
(5 row dataset). It seems the problem is related to handling of array 
fields in structs. Could you confirm if you're able to reproduce using this 
code?
   
   ```
   from pyspark.sql.types import StringType
   from pyspark.sql import functions as F
   from pyspark.sql import types as T
   
   import uuid
   from pyspark.sql import Row
   import random
   
   
   hudi_options = {
   "hoodie.table.name": "clustering_bug_test",
   "hoodie.datasource.write.recordkey.field": "id.value",
   "hoodie.datasource.write.partitionpath.field": "partition:SIMPLE",
   "hoodie.datasource.write.table.name": "clustering_bug_test",
   "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.datasource.write.precombine.field": "publishedAtUnixNano",
   "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   "hoodie.compaction.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   # Turn off small file optimizations
   "hoodie.parquet.small.file.limit": "0",
   # Turn off metadata table
   "hoodie.metadata.enable": "false",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.CustomKeyGenerator",
   # Hive style partitioning
   "hoodie.datasource.write.hive_style_partitioning": "true",
   'hoodie.cleaner.commits.retained': 1,
   "hoodie.bootstrap.index.enable": "false",
   'hoodie.commits.archival.batch': 5,
   # Bloom filter
   "hoodie.index.type": "BLOOM",
   'hoodie.bloom.index.prune.by.ranges': 'false', 
   }
   clustering_hudi_options = {
   **hudi_options,
   "hoodie.clustering.inline": "true",
   "hoodie.clustering.inline.max.commits": 1,
   "hoodie.clustering.plan.strategy.small.file.limit": 256 * 1024 * 1024,
   "hoodie.clustering.plan.strategy.target.file.max.bytes": 512 * 1024 * 
1024,
   "hoodie.clustering.plan.strategy.sort.columns": "id.value",
   "hoodie.clustering.plan.strategy.max.num.groups": 300,
   }
   
   random.seed(10)
   dummy_data = [
   Row(
   id=Row(value=str(uuid.uuid4())),
   publishedAtUnixNano=i,
   partition="1",
   struct_array_column=Row(
   element=[str(random.randint(0, 100)) for i in 
range(random.randint(1, 100))],
   ),
   struct_column=Row(
   nested_array_column=Row(
   element=[str(random.randint(0, 100)) for i in 
range(random.randint(1, 100))],
   ),
   ),
   # This padding ensures files are large enough to reproduce the data 
loss
   **{
   f"col_{i}": str(uuid.uuid4())
   for i in range(100)
   },
   )
   for i in range(5)
   ]
   df_dummy = spark.createDataFrame(dummy_data)
   
   # This was tested in S3
   PATH = f"{OUTPUT_PATH}"
   
df_dummy.write.format("hudi").options(**hudi_options).mode("append").save(PATH)
   
   read_df = spark.read.format("hudi").load(PATH)
   data = read_df.take(1)
   init_count = read_df.count()
   
   # This upsert should be a no-op (re-writing 1 existing row)
   upsert_df = spark.createDataFrame(data, read_df.schema)
   
upsert_df.write.format("hudi").options(**clustering_hudi_options).mode("append").save(PATH)
   
   read_df = spark.read.format("hudi").load(PATH)
   final_count = read_df.count()
   print(f"{init_count}, {final_count}")
   ```
   
   The schema is:
   ```
   root
|-- id: struct (nullable = true)
||-- value: string (nullable = true)
|-- publishedAtUnixNano: long (nullable = true)
|-- partition: string (nullable = true)
|-- struct_array_column: struct (nullable = true)
||-- element: array (nullable = true)
|||-- element: string (containsNull = true)
|-- struct_column: struct (nullable = true)
||-- nested_array_column: struct (nullable = true)
|||-- element: array (nullable = true)
||||-- element: string (containsNull = true)
|-- col_0: string (nullable = true)
|-- col_1: string (nullable = true)
|-- col_2: string (nullable = true)
|-- col_3: string (nullable = true)
   ...
   ```
   
   We expect init_count and final_count to be the same but it's actually (may 
vary)
   ```
   5, 48000
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]

2023-12-04 Thread via GitHub


yihua commented on code in PR #10144:
URL: https://github.com/apache/hudi/pull/10144#discussion_r1414533367


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -310,6 +317,15 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   _: PartitionedFile => Iterator.empty
 }
 
+// Note that for CDC reader, the underlying data schema is stored in the 
'options' to separate from the CDC schema.
+val rawDataSchemaStr = options.getOrElse(rawDataSchema, "")

Review Comment:
   `rawDataSchemaStr` is the table schema.  Can the table schema be directly 
read here instead of being passed in? Does the `tableSchema` represent the 
actual data schema?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCDCFileIndex.scala:
##
@@ -42,29 +42,34 @@ class HoodieCDCFileIndex (override val spark: SparkSession,
   extends HoodieIncrementalFileIndex(
 spark, metaClient, schemaSpec, options, fileStatusCache, includeLogFiles, 
shouldEmbedFileSlices
   ) with FileIndex {
+  private val emptyPartitionPath: String = "empty_partition_path";
   val cdcRelation: CDCRelation = CDCRelation.getCDCRelation(spark.sqlContext, 
metaClient, options)
   val cdcExtractor: HoodieCDCExtractor = cdcRelation.cdcExtractor
 
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
-val partitionToFileGroups = 
cdcExtractor.extractCDCFileSplits().asScala.groupBy(_._1.getPartitionPath).toSeq
-partitionToFileGroups.map {
-  case (partitionPath, fileGroups) =>
-val fileGroupIds: List[FileStatus] = fileGroups.map { fileGroup => {
-  // We create a fake FileStatus to wrap the information of 
HoodieFileGroupId, which are used
-  // later to retrieve the corresponding CDC file group splits.
-  val fileGroupId: HoodieFileGroupId = fileGroup._1
-  new FileStatus(0, true, 0, 0, 0,
-0, null, "", "", null,
-new Path(fileGroupId.getPartitionPath, fileGroupId.getFileId))
-}}.toList
-val partitionValues: InternalRow = new 
GenericInternalRow(doParsePartitionColumnValues(
-  metaClient.getTableConfig.getPartitionFields.get(), 
partitionPath).asInstanceOf[Array[Any]])
+cdcExtractor.extractCDCFileSplits().asScala.map {
+  case (fileGroupId, fileSplits) =>
+val partitionPath = if (fileGroupId.getPartitionPath.isEmpty) 
emptyPartitionPath else fileGroupId.getPartitionPath

Review Comment:
   using empty String instead for non-partitioned table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10238:
URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839467433

   
   ## CI report:
   
   * fce0e1eb204f4377fb9f307168b43017d3acf73d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10224:
URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839408514

   
   ## CI report:
   
   * 05e298ae8c265de111cababf120f194d960f0472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21261)
 
   * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10224:
URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839397147

   
   ## CI report:
   
   * 05e298ae8c265de111cababf120f194d960f0472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21261)
 
   * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Large gap between stages on read [hudi]

2023-12-04 Thread via GitHub


noahtaite commented on issue #10239:
URL: https://github.com/apache/hudi/issues/10239#issuecomment-1839395374

   Stage 1 stack trace:
   ```
   org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) 
org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:103)
 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllFilesInPartitions(FileSystemBackedTableMetadata.java:157)
 
org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPathFiles(BaseHoodieTableFileIndex.java:358)
 
org.apache.hudi.BaseHoodieTableFileIndex.loadFileSlicesForPartitions(BaseHoodieTableFileIndex.java:249)
 
org.apache.hudi.BaseHoodieTableFileIndex.ensurePreloadedPartitions(BaseHoodieTableFileIndex.java:241)
 
org.apache.hudi.BaseHoodieTableFileIndex.getInputFileSlices(BaseHoodieTableFileIndex.java:227)
 
org.apache.hudi.SparkHoodieTableFileIndex.listFileSlices(SparkHoodieTableFileIndex.scala:172)
 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:223)
 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:65)
 org
 .apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:353) 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:365)
 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:399)
 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:478)
 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:398)
 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:365)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
 scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
   ```
   
   -- Large Gap --
   
   Stage 2 stack trace:
   ```
   org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) 
org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.getPartitionPathWithPathPrefix(FileSystemBackedTableMetadata.java:109)
 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.lambda$getPartitionPathWithPathPrefixes$0(FileSystemBackedTableMetadata.java:91)
 java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269) 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) 
org.apache.hudi.metadata.FileSystemBackedTa
 
bleMetadata.getPartitionPathWithPathPrefixes(FileSystemBackedTableMetadata.java:95)
 
org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:281)
 
org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:206)
 
org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:205)
 
org.apache.hudi.SparkHoodieTableFileIndex.listFileSlices(SparkHoodieTableFileIndex.scala:171)
 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:223)
 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:65)
 org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:353) 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:365)
   ```
   
   Based on some digging in the code, I believe FileSystemBackedTableMetadata 
implies that my Hoodie metadata table isn't being referenced correctly. Trying 
to dig into my metadata stats to confirm this. My readers should be using 
"hoodie.metadata.enable" by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

2023-12-04 Thread via GitHub


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-1839189354

   Bump... I think data inconsistency after clustering should be treated as a 
critical priority investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Large gap between stages on read [hudi]

2023-12-04 Thread via GitHub


noahtaite opened a new issue, #10239:
URL: https://github.com/apache/hudi/issues/10239

   **Describe the problem you faced**
   
   I have multiple applications reading our 120 table, 1PB+ Hudi OLAP data lake 
that are seeing gaps of 1hr+ in our application stages when collecting the data:
   
   https://github.com/apache/hudi/assets/24283126/a03dd51b-5f0e-4214-a731-2bf81da95926";>
   
   Note a 1hr gap between stages 12 + 13
   
   I have been able to consistently reproduce this in my dev environment and 
see the following behaviour:
   - Calling .load() on the table finishes quickly.
   - Calling .count() on a specific partition has all jobs in the Spark History 
Server complete in under 10 minutes, but then a 1hr gap is observed before the 
output of the count is reported.
   - During the gap, my cluster auto-scales down to 1 executor
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. 20TB+ Hudi table with ~250k partitions, metadata enabled.
   2. Load + count a single partition.
   3. Observe a large gap when just a single executor is running.
   4. Slow read performance.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.4.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   
   **Stacktrace**
   
   
   I'm just trying to gain a base level understanding of where this time is 
going or if someone can point me in the correct direction for troubleshooting. 
The runtime cost is quite low due to the scaling down but analytics developers 
are not happy with their applications slowing down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10238:
URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839112667

   
   ## CI report:
   
   * fce0e1eb204f4377fb9f307168b43017d3acf73d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10238:
URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839098202

   
   ## CI report:
   
   * fce0e1eb204f4377fb9f307168b43017d3acf73d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table [hudi]

2023-12-04 Thread via GitHub


shubhamn21 commented on issue #10127:
URL: https://github.com/apache/hudi/issues/10127#issuecomment-1839090525

   `23/12/04 08:00:23 WARN CleanActionExecutor: Failed to perform previous 
clean operation, instant: [==>20231204075005981__clean__INFLIGHT]
   java.lang.IllegalArgumentException
at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)`
   
   Hi @nsivabalan , Tagging you here as I had seen you as an assignee for 
similar [issue](https://github.com/apache/hudi/issues/6463). I am seeing the 
above clean action warning which prompts subsequent failures in job. Has this 
got something to do with s3 performance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Handling of DELETE operation using Debezium Kafka connector [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10181:
URL: https://github.com/apache/hudi/issues/10181#issuecomment-1839076172

   @seethb Full Details on this similar issue - 
https://github.com/apache/hudi/issues/9143
   
   Go over it and let us know in case you have any doubts. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]

2023-12-04 Thread via GitHub


the-other-tim-brown opened a new pull request, #10238:
URL: https://github.com/apache/hudi/pull/10238

   ### Change Logs
   
   For files that are not created by Hudi but added to the table (zero copy 
bootstrap or OneTable case) we are unable to remove the column stats after 
these files are removed from the table view.
   
   ### Impact
   
   Allows proper cleanup of metadata table's stats partition
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-04 Thread via GitHub


soumilshah1995 commented on issue #10231:
URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839051919

   @ad1happy2go you think you can help me setup Derby megastore I believe I 
already have I am confused on steps I would appreciate if we can catchup on 
slack and help me understand this a bit. It would be great opportunity for me 
to learn and pass on knowledge down 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10231:
URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839047567

   You can find the hive scripts here - 
https://github.com/apache/hive/tree/master/metastore/scripts/upgrade


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10231:
URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839046110

   @soumilshah1995 Have you configured the external metastore? 
   We need to setup the hive metastore tables. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] What is the priority for the parameter settings of hudi to take effect [hudi]

2023-12-04 Thread via GitHub


ad1happy2go commented on issue #10236:
URL: https://github.com/apache/hudi/issues/10236#issuecomment-1839001438

   @JoshuaZhuCN 
   
   The order you placed is correct. Keep in mind, tblproperties only get used 
in write path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7154:
-

Assignee: sivabalan narayanan

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7154.
-
Resolution: Fixed

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-6822] Fix deletes handling in hbase index when partition path is updated (#9630)

2023-12-04 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b9fb9f616e6 [HUDI-6822] Fix deletes handling in hbase index when 
partition path is updated (#9630)
b9fb9f616e6 is described below

commit b9fb9f616e6585b5e92f796e50ef93747d38fb49
Author: flashJd 
AuthorDate: Tue Dec 5 00:08:35 2023 +0800

[HUDI-6822] Fix deletes handling in hbase index when partition path is 
updated (#9630)


-

Co-authored-by: Balaji Varadarajan 
---
 .../org/apache/hudi/index/HoodieIndexUtils.java|  1 +
 .../metadata/HoodieBackedTableMetadataWriter.java  | 68 +---
 .../hudi/index/hbase/SparkHoodieHBaseIndex.java|  4 +
 .../index/hbase/TestSparkHoodieHBaseIndex.java | 95 ++
 .../org/apache/hudi/common/model/HoodieRecord.java | 23 +-
 .../hudi/common/model/HoodieRecordDelegate.java| 32 ++--
 .../model/TestHoodieRecordSerialization.scala  | 12 +--
 7 files changed, 140 insertions(+), 95 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
index 33e8d501943..de3d181ad06 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
@@ -323,6 +323,7 @@ public class HoodieIndexUtils {
   } else {
 // merged record has a different partition: issue a delete to the 
old partition and insert the merged record to the new partition
 HoodieRecord deleteRecord = createDeleteRecord(config, 
existing.getKey());
+deleteRecord.setIgnoreIndexUpdate(true);
 return Arrays.asList(tagRecord(deleteRecord, 
existing.getCurrentLocation()), merged).iterator();
   }
 });
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index ecdf93eda1d..781a9024117 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -29,10 +29,8 @@ import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.SerializableConfiguration;
 import org.apache.hudi.common.data.HoodieData;
-import org.apache.hudi.common.data.HoodiePairData;
 import org.apache.hudi.common.engine.HoodieEngineContext;
 import org.apache.hudi.common.fs.FSUtils;
-import org.apache.hudi.common.function.SerializableFunction;
 import org.apache.hudi.common.model.FileSlice;
 import org.apache.hudi.common.model.HoodieBaseFile;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
@@ -89,17 +87,14 @@ import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.HashMap;
-import java.util.Iterator;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
-import java.util.Objects;
 import java.util.Set;
 import java.util.function.Function;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;
-import java.util.stream.Stream;
 
 import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields;
 import static 
org.apache.hudi.common.config.HoodieMetadataConfig.DEFAULT_METADATA_POPULATE_META_FIELDS;
@@ -939,8 +934,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 
   // Updates for record index are created by parsing the WriteStatus which 
is a hudi-client object. Hence, we cannot yet move this code
   // to the HoodieTableMetadataUtil class in hudi-common.
-  HoodieData updatesFromWriteStatuses = 
getRecordIndexUpdates(writeStatus);
-  HoodieData additionalUpdates = 
getRecordIndexAdditionalUpdates(updatesFromWriteStatuses, commitMetadata);
+  HoodieData updatesFromWriteStatuses = 
getRecordIndexUpserts(writeStatus);
+  HoodieData additionalUpdates = 
getRecordIndexAdditionalUpserts(updatesFromWriteStatuses, commitMetadata);
   partitionToRecordMap.put(RECORD_INDEX, 
updatesFromWriteStatuses.union(additionalUpdates));
   updateFunctionalIndexIfPresent(commitMetadata, instantTime, 
partitionToRecordMap);
   return partitionToRecordMap;
@@ -953,7 +948,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 processAndCommit(instantTime, () -> {
   Map> 
partitionToRecordMap =
   HoodieTableMetadataUtil.convertMetadataToRecor

Re: [PR] [HUDI-6822] Fix deletes handling in hbase index when partition path is updated [hudi]

2023-12-04 Thread via GitHub


nsivabalan merged PR #9630:
URL: https://github.com/apache/hudi/pull/9630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer (#10198)

2023-12-04 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new df4cca8aa56 [HUDI-7154] Fix NPE from empty batch with row writer 
enabled in Hudi Streamer (#10198)
df4cca8aa56 is described below

commit df4cca8aa560d21bde1bf4c1a4079d3d2f760c6f
Author: Y Ethan Guo 
AuthorDate: Mon Dec 4 08:06:59 2023 -0800

[HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi 
Streamer (#10198)


-

Co-authored-by: sivabalan 
---
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 26 +++
 .../apache/hudi/utilities/streamer/StreamSync.java |  5 ++-
 .../deltastreamer/TestHoodieDeltaStreamer.java | 51 ++
 3 files changed, 62 insertions(+), 20 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index b8dbb18287e..e925e2a5423 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -155,19 +155,27 @@ object HoodieSparkSqlWriter {
 Metrics.shutdownAllMetrics()
   }
 
-  def getBulkInsertRowConfig(writerSchema: Schema, hoodieConfig: HoodieConfig,
+  def getBulkInsertRowConfig(writerSchema: 
org.apache.hudi.common.util.Option[Schema], hoodieConfig: HoodieConfig,
  basePath: String, tblName: String): 
HoodieWriteConfig = {
-val writerSchemaStr = writerSchema.toString
-
+var writerSchemaStr : String = null
+if ( writerSchema.isPresent) {
+  writerSchemaStr = writerSchema.get().toString
+}
 // Make opts mutable since it could be modified by 
tryOverrideParquetWriteLegacyFormatProperty
-val opts = mutable.Map() ++ hoodieConfig.getProps.toMap ++
-  Map(HoodieWriteConfig.AVRO_SCHEMA_STRING.key -> writerSchemaStr)
+val optsWithoutSchema = mutable.Map() ++ hoodieConfig.getProps.toMap
+val opts = if (writerSchema.isPresent) {
+  optsWithoutSchema ++ Map(HoodieWriteConfig.AVRO_SCHEMA_STRING.key -> 
writerSchemaStr)
+} else {
+  optsWithoutSchema
+}
+
+if (writerSchema.isPresent) {
+  // Auto set the value of "hoodie.parquet.writelegacyformat.enabled"
+  tryOverrideParquetWriteLegacyFormatProperty(opts, 
convertAvroSchemaToStructType(writerSchema.get))
+}
 
-// Auto set the value of "hoodie.parquet.writelegacyformat.enabled"
-tryOverrideParquetWriteLegacyFormatProperty(opts, 
convertAvroSchemaToStructType(writerSchema))
 DataSourceUtils.createHoodieConfig(writerSchemaStr, basePath, tblName, 
opts)
   }
-
 }
 
 class HoodieSparkSqlWriterInternal {
@@ -779,7 +787,7 @@ class HoodieSparkSqlWriterInternal {
 val sqlContext = 
writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext
 val jsc = 
writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getJavaSparkContext
 
-val writeConfig = 
HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema, hoodieConfig, 
basePath.toString, tblName)
+val writeConfig = 
HoodieSparkSqlWriter.getBulkInsertRowConfig(org.apache.hudi.common.util.Option.of(writerSchema),
 hoodieConfig, basePath.toString, tblName)
 val overwriteOperationType = 
Option(hoodieConfig.getString(HoodieInternalConfig.BULKINSERT_OVERWRITE_OPERATION_TYPE))
   .map(WriteOperationType.fromValue)
   .orNull
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
index 19289e650c4..ff2debc8dcc 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
@@ -757,7 +757,8 @@ public class StreamSync implements Serializable, Closeable {
 hoodieConfig.setValue(DataSourceWriteOptions.PAYLOAD_CLASS_NAME().key(), 
cfg.payloadClassName);
 hoodieConfig.setValue(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key(), 
HoodieSparkKeyGeneratorFactory.getKeyGeneratorClassName(props));
 hoodieConfig.setValue("path", cfg.targetBasePath);
-return HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema, 
hoodieConfig, cfg.targetBasePath, cfg.targetTableName);
+return HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema != 
InputBatch.NULL_SCHEMA ? Option.of(writerSchema) : Option.empty(),
+hoodieConfig, cfg.targetBasePath, cfg.targetTableName);
   }
 
   /**
@@ -899,7 +900,7 @@ public class StreamSync implements Serializable, Closeable {
 instantTime = startCommit(instantTime, !autoGenerateRecordKeys);
 
 if (useRow

Re: [PR] [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer [hudi]

2023-12-04 Thread via GitHub


nsivabalan merged PR #10198:
URL: https://github.com/apache/hudi/pull/10198


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer [hudi]

2023-12-04 Thread via GitHub


nsivabalan commented on PR #10198:
URL: https://github.com/apache/hudi/pull/10198#issuecomment-1838961926

   https://github.com/apache/hudi/assets/513218/0c31514a-8f93-41a3-adbc-63ccdceb2e5e";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838828006

   
   ## CI report:
   
   * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Failed to create Marker file [hudi]

2023-12-04 Thread via GitHub


GergelyKalmar commented on issue #7909:
URL: https://github.com/apache/hudi/issues/7909#issuecomment-1838811464

   We're using Hudi `0.12.1` via AWS Glue and we also started facing the 
"Failed to create marker file" errors. We tried to change the configuration and 
use `hoodie.write.markers.type=DIRECT`, however, now we're seeing throttling 
errors:
   
   ```
   org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :20
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1517)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
   Caused by: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; 
Request ID: xxx; S3 Extended Request ID: xxx; Proxy: null), S3 Extended Request 
ID: xxx
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(Amaz

[jira] [Updated] (HUDI-2857) HoodieTableMetaClient.TEMPFOLDER_NAME causes IllegalArgumentException in windows environment

2023-12-04 Thread wang fanming (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang fanming updated HUDI-2857:
---
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> HoodieTableMetaClient.TEMPFOLDER_NAME causes IllegalArgumentException in 
> windows environment
> 
>
> Key: HUDI-2857
> URL: https://issues.apache.org/jira/browse/HUDI-2857
> Project: Apache Hudi
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: win10   spark2.4.4 hudi 0.9.0
>Reporter: wang fanming
>Priority: Minor
>  Labels: core-flow-ds, easyfix, sev:high
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code:java}
> val tableName = "cow_prices"
> val basePath = "hdfs://x:9000//tmp//cow_prices//"
> val dataGen = new DataGenerator
> // spark-shell
> val inserts = convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> df.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(PRECOMBINE_FIELD.key(), "ts").
> option(RECORDKEY_FIELD.key(), "uuid").
> option(PARTITIONPATH_FIELD.key(), "partitionpath").
> option(TBL_NAME.key(), tableName).
> mode(Overwrite).
> save(basePath) {code}
> The above is the sample code provided by Hudi's official website. I plan to 
> run the Spark program directly on the win10 environment and store the data on 
> the remote HDFS.The following exception occurred:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Not in marker dir. Marker 
> Path=hdfs://10.38.23.2:9000/tmp/cow_prices/.hoodie\.temp/20211125163531/asia/india/chennai/c9218a3b-f248-436b-b41f-4a0b968dfff2-0_2-27-29_20211125163531.parquet.marker.CREATE,
>  Expected Marker Root=/tmp/cow_prices/.hoodie/.temp/20211125163531
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>     at 
> org.apache.hudi.common.util.MarkerUtils.stripMarkerFolderPrefix(MarkerUtils.java:87)
>     at 
> org.apache.hudi.common.util.MarkerUtils.stripMarkerFolderPrefix(MarkerUtils.java:75)
>     at 
> org.apache.hudi.table.marker.DirectWriteMarkers.translateMarkerToDataPath(DirectWriteMarkers.java:153)
>     at 
> org.apache.hudi.table.marker.DirectWriteMarkers.lambda$createdAndMergedDataPaths$69cdea3b$1(DirectWriteMarkers.java:142)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:78)
>     at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>     at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>     at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>     at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>     at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>     at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>     at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>     at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> {code}
> After investigation, it was found that the root cause of the abnormality was 
> that 
> {code:java}
> HoodieTableMetaClient.TEMPFOLDER_

  1   2   >