Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-17 Thread via GitHub


yihua commented on code in PR #9879:
URL: https://github.com/apache/hudi/pull/9879#discussion_r1363329412


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -241,7 +241,12 @@ private void scanInternalV1(Option keySpecOpt) {
 try {
   // Iterate over the paths
   logFormatReaderWrapper = new HoodieLogFormatReader(fs,
-  logFilePaths.stream().map(logFile -> new HoodieLogFile(new 
CachingPath(logFile))).collect(Collectors.toList()),
+  logFilePaths.stream()

Review Comment:
   Should the same filtering logic be wrapped into a util method given it is 
also used in `scanInternalV2`?



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -241,7 +241,12 @@ private void scanInternalV1(Option keySpecOpt) {
 try {
   // Iterate over the paths
   logFormatReaderWrapper = new HoodieLogFormatReader(fs,
-  logFilePaths.stream().map(logFile -> new HoodieLogFile(new 
CachingPath(logFile))).collect(Collectors.toList()),
+  logFilePaths.stream()
+  .map(filePath -> new HoodieLogFile(new CachingPath(filePath)))
+  // hit an uncommitted file possibly from a failed write, skip 
processing this one
+  .filter(logFile -> 
completedInstantsTimeline.containsOrBeforeTimelineStarts(logFile.getDeltaCommitTime())

Review Comment:
   Should this logic be dependent on table version?



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -269,11 +274,6 @@ private void scanInternalV1(Option keySpecOpt) {
   break;
 }
 if (logBlock.getBlockType() != CORRUPT_BLOCK && 
logBlock.getBlockType() != COMMAND_BLOCK) {
-  if 
(!completedInstantsTimeline.containsOrBeforeTimelineStarts(instantTime)

Review Comment:
   Similar here wrt table version.  Before log file name change, this logic is 
still needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #6983:
URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767760342

   
   ## CI report:
   
   * 003721a9e975415951aed2725a744b29f87cacc1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20377)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] HFileBootstrapIndex: use try-with-resources in two places (#9813)

2023-10-17 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9b7e539a902 [MINOR] HFileBootstrapIndex: use try-with-resources in two 
places (#9813)
9b7e539a902 is described below

commit 9b7e539a902cb2cf594a799957af260cf00ab8b4
Author: Tim Brown 
AuthorDate: Wed Oct 18 01:21:51 2023 -0500

[MINOR] HFileBootstrapIndex: use try-with-resources in two places (#9813)
---
 .../org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
index 0d821ffe103..27314f150dc 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
@@ -322,8 +322,7 @@ public class HFileBootstrapIndex extends BootstrapIndex {
 
 @Override
 public List getSourceFileMappingForPartition(String 
partition) {
-  try {
-HFileScanner scanner = partitionIndexReader().getScanner(true, false);
+  try (HFileScanner scanner = partitionIndexReader().getScanner(true, 
false)) {
 KeyValue keyValue = new 
KeyValue(Bytes.toBytes(getPartitionKey(partition)), new byte[0], new byte[0],
 HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put, new byte[0]);
 if (scanner.seekTo(keyValue) == 0) {
@@ -355,8 +354,7 @@ public class HFileBootstrapIndex extends BootstrapIndex {
   // Arrange input Keys in sorted order for 1 pass scan
   List fileGroupIds = new ArrayList<>(ids);
   Collections.sort(fileGroupIds);
-  try {
-HFileScanner scanner = fileIdIndexReader().getScanner(true, false);
+  try (HFileScanner scanner = fileIdIndexReader().getScanner(true, false)) 
{
 for (HoodieFileGroupId fileGroupId : fileGroupIds) {
   KeyValue keyValue = new 
KeyValue(Bytes.toBytes(getFileGroupKey(fileGroupId)), new byte[0], new byte[0],
   HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put, new byte[0]);



Re: [PR] [MINOR] HFileBootstrapIndex: use try-with-resources in two places [hudi]

2023-10-17 Thread via GitHub


yihua merged PR #9813:
URL: https://github.com/apache/hudi/pull/9813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767680714

   
   ## CI report:
   
   * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN
   * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN
   * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20374)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Trino queries failing when hudi.metadata_enabled is set to true. [hudi]

2023-10-17 Thread via GitHub


BalaMahesh commented on issue #9758:
URL: https://github.com/apache/hudi/issues/9758#issuecomment-1767671730

   @ad1happy2go  / @codope - 
   
   hoodie.metadata.compact.max.delta.commits=1
   
   with this config I expect compaction to run after every metadata commit, but 
that is not happening and all the delta commits are being piled up on metadata 
path and after accumulating large number of files, compaction is being 
triggered and failing with OOM error. 
   
   I didn't get enough time to go through the code flow for this metadata 
table. 
   
   We can increase the memory fraction - if this the problem it should at least 
trigger the compaction after every delta commit and fail right with 
hoodie.metadata.compact.max.delta.commits set to 1. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


zhuanshenbsj1 commented on code in PR #9878:
URL: https://github.com/apache/hudi/pull/9878#discussion_r1363170585


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java:
##
@@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context 
context) {
   // bootstrap
   final DataStream hoodieRecordDataStream =
   Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), 
overwrite);
+
   // write pipeline
   pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream);
-  // compaction
+
+  // insert cluster mode
+  if (OptionsResolver.isInsertClusterMode(conf)) {
+return Pipelines.clean(conf, pipeline);
+  }
+
+  // upsert mode
   if (OptionsResolver.needsAsyncCompaction(conf)) {
 // use synchronous compaction for bounded source.
 if (context.isBounded()) {
   conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false);
 }
 return Pipelines.compact(conf, pipeline);
-  } else {
+  } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) {
 return Pipelines.clean(conf, pipeline);
+  } else {
+return Pipelines.dummySink(pipeline);

Review Comment:
   Similar to clustering, cleaning is performed wherever merging is 
performed(inline or offline). 
   
   ```
 if (OptionsResolver.needsAsyncCompaction(conf)) { // 1
   // use synchronous compaction for bounded source.
   if (context.isBounded()) {
 conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false);
   }
   return Pipelines.compact(conf, pipeline);
 } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) { //2.1
   return Pipelines.clean(conf, pipeline);
 } else { //2.2
   return Pipelines.dummySink(pipeline);
 }
   
   ```
   
   1. If flink online asynchronous merge  execute is turned on, 
cluster/compactor commit operator will do clean.
   2. If flink online asynchronous merge  execute is turned off, there are two 
situations
2.1 To enable lazy cleaning, it is necessary to add the clean operator for 
rollback.
2.2 To disable lazy cleaning, there is no need to consider rollback. Clean 
will be called after offline task execution is completed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767633844

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 8826dfc2e2487c43703787c737d8143c6bb7285a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20373)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9053:
URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767633129

   
   ## CI report:
   
   * ff5cd07154d48f18d8034075c8dfc3990b204cbe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18532)
 
   * 8caf378576e4c7e68cdd32d1e24d89afc05b056b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20379)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5220] fix hive snapshot query add non hoodie paths file status [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #7206:
URL: https://github.com/apache/hudi/pull/7206#issuecomment-1767631379

   
   ## CI report:
   
   * 5d7a1c4824c100a48c95e3d017822aa1062ad8cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20258)
 
   * 9d653e9325beb6e3391607d073dfa8c030ee798f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20378)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #6983:
URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767631150

   
   ## CI report:
   
   * 0593cd212628684db658d7a8bdd8fc320069d090 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20225)
 
   * 003721a9e975415951aed2725a744b29f87cacc1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20377)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9053:
URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767625108

   
   ## CI report:
   
   * ff5cd07154d48f18d8034075c8dfc3990b204cbe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18532)
 
   * 8caf378576e4c7e68cdd32d1e24d89afc05b056b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5220] fix hive snapshot query add non hoodie paths file status [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #7206:
URL: https://github.com/apache/hudi/pull/7206#issuecomment-1767623659

   
   ## CI report:
   
   * 5d7a1c4824c100a48c95e3d017822aa1062ad8cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20258)
 
   * 9d653e9325beb6e3391607d073dfa8c030ee798f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5031] Fix MERGE INTO creates empty partition files when source table has partitions but target table does not [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #6983:
URL: https://github.com/apache/hudi/pull/6983#issuecomment-1767623459

   
   ## CI report:
   
   * 0593cd212628684db658d7a8bdd8fc320069d090 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20225)
 
   * 003721a9e975415951aed2725a744b29f87cacc1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9879:
URL: https://github.com/apache/hudi/pull/9879#issuecomment-1767619953

   
   ## CI report:
   
   * aa997ac209a57ace18f76bdc5fa602d0bead8345 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20376)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9878:
URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767619917

   
   ## CI report:
   
   * 4eb8e5387cb728bde662a45f77062bd574c6cff0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20375)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767619892

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767619726

   
   ## CI report:
   
   * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN
   * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN
   * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363)
 
   * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20374)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


danny0405 commented on code in PR #9878:
URL: https://github.com/apache/hudi/pull/9878#discussion_r1363158192


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java:
##
@@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context 
context) {
   // bootstrap
   final DataStream hoodieRecordDataStream =
   Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), 
overwrite);
+
   // write pipeline
   pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream);
-  // compaction
+
+  // insert cluster mode
+  if (OptionsResolver.isInsertClusterMode(conf)) {
+return Pipelines.clean(conf, pipeline);
+  }
+
+  // upsert mode
   if (OptionsResolver.needsAsyncCompaction(conf)) {
 // use synchronous compaction for bounded source.
 if (context.isBounded()) {
   conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false);
 }
 return Pipelines.compact(conf, pipeline);
-  } else {
+  } else if (OptionsResolver.isLazyFailedWritesCleanPolicy(conf)) {
 return Pipelines.clean(conf, pipeline);
+  } else {
+return Pipelines.dummySink(pipeline);

Review Comment:
   Sorry, I didn't get why you remove the clean operators then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


danny0405 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1363138608


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -206,6 +213,9 @@ public IndexedRecord next() {
 IndexedRecord record = this.reader.read(null, decoder);
 this.dis.skipBytes(recordLength);
 this.readRecords++;
+if (this.promotedSchema.isPresent()) {
+  return  HoodieAvroUtils.rewriteRecordWithNewSchema(record, 
this.promotedSchema.get());

Review Comment:
   Yeah, if there are some data types that require a rewrite then keeping it as 
it is might be good now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6369] Fix spacial curve with sample strategy fails when 0 or 1 rows only is incoming [hudi]

2023-10-17 Thread via GitHub


bvaradar commented on PR #9053:
URL: https://github.com/apache/hudi/pull/9053#issuecomment-1767610090

   Looks good to me. Resurrecting this PR with rebase and minor test class 
rename. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9879:
URL: https://github.com/apache/hudi/pull/9879#issuecomment-1767581232

   
   ## CI report:
   
   * aa997ac209a57ace18f76bdc5fa602d0bead8345 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9878:
URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767581203

   
   ## CI report:
   
   * 4eb8e5387cb728bde662a45f77062bd574c6cff0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767581062

   
   ## CI report:
   
   * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN
   * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN
   * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363)
 
   * 1597dfa2436c2789e5a5e8dbecfe4f900383c35d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767581022

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371)
 
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 8826dfc2e2487c43703787c737d8143c6bb7285a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20373)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767573921

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371)
 
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 8826dfc2e2487c43703787c737d8143c6bb7285a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


zhuanshenbsj1 commented on PR #9878:
URL: https://github.com/apache/hudi/pull/9878#issuecomment-1767563266

   cc @danny0405  @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]

2023-10-17 Thread via GitHub


TengHuo commented on PR #9413:
URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767562595

   > In my company, I also encountered a situation where the entire table 
directory was deleted
   
   Hi @wqlsdb would you like to discuss it offline or email?
   
   We encountered this issue multiple times internally, and we are trying to 
find the root cause. Think it could be helpful if we can find sync some common 
information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]

2023-10-17 Thread via GitHub


danny0405 commented on PR #9413:
URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767562306

   @wqlsdb , would you mind to cherry-pick this fix into your local repo?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6952) Skip reading the uncommitted log files for log reader

2023-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6952:
-
Labels: pull-request-available  (was: )

> Skip reading the uncommitted log files for log reader
> -
>
> Key: HUDI-6952
> URL: https://issues.apache.org/jira/browse/HUDI-6952
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-17 Thread via GitHub


danny0405 opened a new pull request, #9879:
URL: https://github.com/apache/hudi/pull/9879

   ### Change Logs
   
   This is to avoid potential exceptions when the reader is processing an 
uncommitted log file while the cleaning or rollback service removes the log 
file.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6675] Fix Clean action will delete the whole table [hudi]

2023-10-17 Thread via GitHub


wqlsdb commented on PR #9413:
URL: https://github.com/apache/hudi/pull/9413#issuecomment-1767559647

   In my company, I also encountered a situation where the entire table 
directory was deleted


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-17 Thread via GitHub


zhuanshenbsj1 opened a new pull request, #9878:
URL: https://github.com/apache/hudi/pull/9878

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Dirty data filtering failed [hudi]

2023-10-17 Thread via GitHub


deasea opened a new issue, #9877:
URL: https://github.com/apache/hudi/issues/9877

   HI,  I encountered an exception
   
![image](https://github.com/apache/hudi/assets/35282893/cb4c90c0-f62a-4799-b598-a7d7348ae293)
   
   I want to filter out this dirty data when entering the lake ,I tried 2 
parameters, neither worked.
   
   write.ignore.failed:true
   hoodie.datasource.write.streaming.ignore.failed.batch : true
   
   flink 1.13.5  hudi 0.13/0.10
   
   How should we skip dirty data in this scenario?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767533835

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371)
 
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767534035

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767526069

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767518410

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * bc45850e7a2962242d4e99e88b07c89b8c8e19bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20371)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6953) Optimizing hudi sink operators generation

2023-10-17 Thread zhuanshenbsj1 (Jira)
zhuanshenbsj1 created HUDI-6953:
---

 Summary: Optimizing hudi sink operators generation
 Key: HUDI-6953
 URL: https://issues.apache.org/jira/browse/HUDI-6953
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: zhuanshenbsj1
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-17 Thread via GitHub


stream2000 commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1767491739

   > could you write a test or provide a some sample code to trigger this 
issue? I'm a little unclear if this is solving a race condition or something 
else
   @jonvex We can trigger the issue by the following code( not stable) : 
   ```scala
 test("Test concurrent overwrite") {
   withTempDir { tmp =>
 import spark.implicits._
 val day = "2021-08-02"
 val threadCount = 12
 val df = Seq((1, "a1", 10, 1000, day, 12)).toDF("id", "name", "value", 
"ts", "day", "hh")
 val executors = Executors.newFixedThreadPool(threadCount)
 var futures: Array[Future[_]] = new Array(threadCount /2 )
 for (i <- 0 until threadCount  / 2 ) {
   val overwriteTask = new Runnable {
 override def run(): Unit = {
   val tableName = "table_name" + i
   val tablePath = s"${tmp.getCanonicalPath}/$tableName"
   // Write a table by spark dataframe.
   df.write.format("hudi")
 .option(HoodieWriteConfig.TBL_NAME.key, tableName)
 .option(TABLE_TYPE.key, MOR_TABLE_TYPE_OPT_VAL)
 .option(RECORDKEY_FIELD.key, "id")
 .option(PRECOMBINE_FIELD.key, "ts")
 .option(PARTITIONPATH_FIELD.key, "day,hh")
 .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key, "1")
 .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "1")
 .option(HoodieWriteConfig.ALLOW_OPERATION_METADATA_FIELD.key, 
"true")
 .mode(SaveMode.Overwrite)
 .save(tablePath)
 }
   }
   futures(i) = executors.submit(overwriteTask)
 }
 futures.foreach(f => f.get())
 futures = new Array(threadCount)
 for (i <- 0 until threadCount) {
   val overwriteTask = new Runnable {
 override def run(): Unit = {
   val tableName = "table_name" + (12 - i)
   val tablePath = s"${tmp.getCanonicalPath}/$tableName"
   // Write a table by spark dataframe.
   df.write.format("hudi")
 .option(HoodieWriteConfig.TBL_NAME.key, tableName)
 .option(TABLE_TYPE.key, MOR_TABLE_TYPE_OPT_VAL)
 .option(RECORDKEY_FIELD.key, "id")
 .option(PRECOMBINE_FIELD.key, "ts")
 .option(PARTITIONPATH_FIELD.key, "day,hh")
 .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key, "1")
 .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key, "1")
 .option(HoodieWriteConfig.ALLOW_OPERATION_METADATA_FIELD.key, 
"true")
 .mode(SaveMode.Append)
 .save(tablePath)
 }
   }
   futures(i) = executors.submit(overwriteTask)
 }
 futures.foreach(f => f.get())
   }
 }
   ```
   
   
   And we will get exception stack trace sometimes like this: 
   
   ```txt
   Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table 
not found in path 
/private/var/folders/q1/_zbtr5t97rz27jb_f3ph8chmgp/T/spark-d9e6236f-4d31-4ea1-a60a-df21c5d1d545/table_name12/.hoodie
at 
org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:57)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:149)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:735)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:91)
at 
org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:826)
at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$getHoodieTableConfig$1(HoodieSparkSqlWriter.scala:1165)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.hudi.HoodieSparkSqlWriter$.getHoodieTableConfig(HoodieSparkSqlWriter.scala:1166)
at 
org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:172)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:133)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6653) Support position-based merging of base and log files

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6653:
---

Assignee: Lin Liu  (was: Ethan Guo)

> Support position-based merging of base and log files
> 
>
> Key: HUDI-6653
> URL: https://issues.apache.org/jira/browse/HUDI-6653
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6653) Support position-based merging of base and log files

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6653:

Status: In Progress  (was: Open)

> Support position-based merging of base and log files
> 
>
> Key: HUDI-6653
> URL: https://issues.apache.org/jira/browse/HUDI-6653
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6653) Support position-based merging of base and log files

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6653:

Status: Patch Available  (was: In Progress)

> Support position-based merging of base and log files
> 
>
> Key: HUDI-6653
> URL: https://issues.apache.org/jira/browse/HUDI-6653
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6798) Implement event-time-based merging mode in FileGroupReader

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6798:
---

Assignee: Ethan Guo

> Implement event-time-based merging mode in FileGroupReader
> --
>
> Key: HUDI-6798
> URL: https://issues.apache.org/jira/browse/HUDI-6798
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6788) Integrate FileGroupReader with MergeOnReadInputFormat for Flink

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6788:
---

Assignee: (was: Ethan Guo)

> Integrate FileGroupReader with MergeOnReadInputFormat for Flink
> ---
>
> Key: HUDI-6788
> URL: https://issues.apache.org/jira/browse/HUDI-6788
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6787:
---

Assignee: (was: Ethan Guo)

> Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> -
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6949) Spark support non-blocking concurrency control

2023-10-17 Thread Jing Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhang reassigned HUDI-6949:


Assignee: Jing Zhang

> Spark support non-blocking concurrency control
> --
>
> Key: HUDI-6949
> URL: https://issues.apache.org/jira/browse/HUDI-6949
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, spark-sql
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6801) Implement merging of partial updates in FileGroupReader

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6801:

Status: In Progress  (was: Open)

> Implement merging of partial updates in FileGroupReader
> ---
>
> Key: HUDI-6801
> URL: https://issues.apache.org/jira/browse/HUDI-6801
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6800:

Status: Patch Available  (was: In Progress)

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6800:

Status: In Progress  (was: Open)

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6800:
-
Labels: pull-request-available  (was: )

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


yihua opened a new pull request, #9876:
URL: https://github.com/apache/hudi/pull/9876

   ### Change Logs
   
   This PR adds the functionality to write partial updates to the data blocks 
in MOR tables, for Spark SQL MERGE INTO. 
   
   ### Impact
   
   Reduces write amplification
   
   ### Risk level
   
   medium
   
   ### Documentation Update
   
   New feature docs
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767481158

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369)
 
   * bc45850e7a2962242d4e99e88b07c89b8c8e19bf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6952) Skip reading the uncommitted log files for log reader

2023-10-17 Thread Danny Chen (Jira)
Danny Chen created HUDI-6952:


 Summary: Skip reading the uncommitted log files for log reader
 Key: HUDI-6952
 URL: https://issues.apache.org/jira/browse/HUDI-6952
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6801) Implement merging of partial updates in FileGroupReader

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6801:
---

Assignee: Ethan Guo

> Implement merging of partial updates in FileGroupReader
> ---
>
> Key: HUDI-6801
> URL: https://issues.apache.org/jira/browse/HUDI-6801
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6950.

Resolution: Fixed

Fixed via master branch: fae20cd12a0057c8dda7f302699f65a2fe335d0a

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Assignee: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg, dump.jpg, 
> fix_stages.jpg, oom_stages.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition (#9875)

2023-10-17 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fae20cd12a0 [HUDI-6950] Query should process listed partitions to 
avoid driver oom due to large number files in table first partition (#9875)
fae20cd12a0 is described below

commit fae20cd12a0057c8dda7f302699f65a2fe335d0a
Author: xuzifu666 
AuthorDate: Wed Oct 18 08:40:03 2023 +0800

[HUDI-6950] Query should process listed partitions to avoid driver oom due 
to large number files in table first partition (#9875)
---
 .../metadata/FileSystemBackedTableMetadata.java| 95 --
 1 file changed, 54 insertions(+), 41 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
index f4cd7c29074..3737793e0c6 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
@@ -54,7 +54,6 @@ import java.util.List;
 import java.util.Map;
 import java.util.concurrent.CopyOnWriteArrayList;
 import java.util.stream.Collectors;
-import java.util.stream.Stream;
 
 /**
  * Implementation of {@link HoodieTableMetadata} based file-system-backed 
table metadata.
@@ -157,52 +156,66 @@ public class FileSystemBackedTableMetadata extends 
AbstractHoodieTableMetadata {
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileLis

Re: [PR] [HUDI-6950] Query should process listed partitions to avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


danny0405 merged PR #9875:
URL: https://github.com/apache/hudi/pull/9875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]

2023-10-17 Thread via GitHub


arunvasudevan commented on issue #9861:
URL: https://github.com/apache/hudi/issues/9861#issuecomment-1767272065

   Yes, checked the archive folder and it is empty in this case. 
   
   Here are the writer configurtions.
   
   hoodie.datasource.hive_sync.database: 
   hoodie.datasource.hive_sync.mode: HMS
   hoodie.datasource.write.precombine.field: source_ts_ms
   hoodie.datasource.hive_sync.partition_extractor_class: 
org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.parquet.max.file.size: 67108864
   hoodie.datasource.meta.sync.enable: true
   hoodie.datasource.hive_sync.skip_ro_suffix: true
   hoodie.metadata.enable: false
   hoodie.datasource.hive_sync.table: 
   hoodie.index.type: SIMPLE
   hoodie.clean.automatic: true
   hoodie.datasource.write.operation: upsert
   hoodie.metrics.reporter.type: CLOUDWATCH
   hoodie.datasource.hive_sync.enable: true
   hoodie.datasource.write.recordkey.field: version_id
   hoodie.table.name: ride_version
   hoodie.datasource.hive_sync.jdbcurl: jdbc:hive2://ip-:1
   hoodie.datasource.write.table.type: MERGE_ON_READ
   hoodie.simple.index.parallelism: 240
   hoodie.write.lock.dynamodb.partition_key: 
   hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS
   hoodie.compact.inline: true
   hoodie.client.heartbeat.interval_in_ms: 60
   hoodie.datasource.compaction.async.enable: true
   hoodie.metrics.on: true
   hoodie.datasource.write.keygenerator.class: 
org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.cleaner.policy.failed.writes: LAZY
   hoodie.keep.max.commits: 1650
   hoodie.cleaner.hours.retained: 168
   hoodie.write.lock.dynamodb.table: peloton-prod-hudi-write-lock
   hoodie.write.lock.provider: 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   hoodie.keep.min.commits: 1600
   hoodie.datasource.write.partitionpath.field: 
   hoodie.compact.inline.max.delta.commits: 1
   hoodie.write.concurrency.mode: optimistic_concurrency_control
   hoodie.write.lock.dynamodb.region: us-east-1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767246606

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-17 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu reassigned HUDI-6793:
-

Assignee: Lin Liu  (was: Jonathan Vexler)

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader

2023-10-17 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6790:
--
Status: In Progress  (was: Open)

> Support incremental read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6790
> URL: https://issues.apache.org/jira/browse/HUDI-6790
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader

2023-10-17 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu reassigned HUDI-6790:
-

Assignee: Lin Liu  (was: Jonathan Vexler)

> Support incremental read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6790
> URL: https://issues.apache.org/jira/browse/HUDI-6790
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767225223

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * 66df70555e5fc284eeedb1fdbfecbc141b03678a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20368)
 
   * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20369)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767156435

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * ce8a55919d455f5582a0aa18069d57cbd645e37b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20362)
 
   * 66df70555e5fc284eeedb1fdbfecbc141b03678a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20368)
 
   * 9e8e32cf81bc88bf9b9cd2b5ebb26fa5d195e6cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767139167

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * ce8a55919d455f5582a0aa18069d57cbd645e37b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20362)
 
   * 66df70555e5fc284eeedb1fdbfecbc141b03678a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


lokesh-lingarajan-0310 commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1767096533

   @the-other-tim-brown - 
   What would the behavior be when this is false and schema evolution is 
enabled? Is there an option where it would auto-drop the column in the target 
table?
   
   Reply - Currently the plan is to support all these evolution oob and not 
rely on the schema evolution flags. The idea of delete flag here is we are 
changing the default behavior for delete column in oob, so to not break 
backward compatibility, we will start with false as a default and after a 
couple of release we will put a warning note and make it default behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1362693693


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -538,6 +538,8 @@ object DataSourceWriteOptions {
 
   val RECONCILE_SCHEMA: ConfigProperty[java.lang.Boolean] = 
HoodieCommonConfig.RECONCILE_SCHEMA
 
+  val ADD_NULL_FOR_DELETED_COLUMNS: ConfigProperty[String] = 
HoodieCommonConfig.ADD_NULL_FOR_DELETED_COLUMNS

Review Comment:
   For now we will keep it turned off so that we dont break backward 
compatability. If some oss users are relying on failing the streams with 
deleted columns, then making it a default evol for deleted columns will break 
those pipelines. Agree default true is better, but may be after a couple of 
release we can make that change IMO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

2023-10-17 Thread via GitHub


Jason-liujc commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1766886802

   Can't speak to what the official guidance from Hudi is at the moment (seems 
like they will rollout the non-blocking concurent write feature in version 
1.0+).
   
   We had to increase `yarn.resourcemanager.am.max-attempts` and 
`spark.yarn.maxAppAttempts` (the spark specific config) to make it retry more 
and reoganize our tables to reduce concurrent writes. Any other lock provider 
wasn't an option for us since we are running different jobs from different 
clusters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-17 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1362473548


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##
@@ -661,6 +652,35 @@ private Pair>> fetchFromSourc
 return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
   }
 
+  /**
+   * Apply schema reconcile and schema evolution rules(schema on read) and 
generate new target schema provider.
+   *
+   * @param incomingSchema schema of the source data
+   * @param sourceSchemaProvider Source schema provider.
+   * @return the SchemaProvider that can be used as writer schema.
+   */
+  private SchemaProvider getDeducedSchemaProvider(Schema incomingSchema, 
SchemaProvider sourceSchemaProvider) {

Review Comment:
   This function just picks up latest table schema for writing in case schema 
provider is set to NULL schema. All the evolution is handled in 
getDeducedSchemaProvider api



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6800:
---

Assignee: Ethan Guo

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo reassigned HUDI-6800:
-

Assignee: Y Ethan Guo

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Y Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-17 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo reassigned HUDI-6800:
-

Assignee: (was: Y Ethan Guo)

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-6950:
-
Attachment: dump.jpg

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Assignee: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg, dump.jpg, 
> fix_stages.jpg, oom_stages.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766677931

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Apache Flink 1.16 with Apache Hudi [hudi]

2023-10-17 Thread via GitHub


mintesnot77 commented on issue #9777:
URL: https://github.com/apache/hudi/issues/9777#issuecomment-175756

   hello i need your help i am new for hudi i try to integrate apache hudi with 
my hadoop cluster (1 master and 3 slaves but master one also act as slaves)  
but i dont understand how to configure the hudi conf i just give the basepath 
hdfs:my-master-ip-address/sth just following the quick start 
https://hudi.apache.org/docs/quick-start-guide but i got a lot of error like 
below
   
   Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2fbdf32e, 
see the next exception for details.
   at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
   at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
   ... 143 more
   Caused by: ERROR XSDB6: Another instance of Derby may have already booted 
the database /home/hduser_/metastore_db.
   at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
   at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
   at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
 Source)
   at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
 Source)
   at 
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
   at 
org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
   at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
   at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown Source)
   at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
   at 
org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
   at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
   at org.apache.derby.impl.store.access.RAMAccessManager$5.run(Unknown 
Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
org.apache.derby.impl.store.access.RAMAccessManager.bootServiceModule(Unknown 
Source)
   at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown 
Source)
   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
   at 
org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknown Source)
   at 
org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
   at org.apache.derby.impl.db.BasicDatabase$5.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at org.apache.derby.impl.db.BasicDatabase.bootServiceModule(Unknown 
Source)
   at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
   at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
Source)
   at 
org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown 
Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown
 Source)
   at 
org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown
 Source)
   at 
org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown 
Source)
   at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source)
   at org.apache.derby.impl.jdbc.EmbedConnection$4.run(Unknown Source)

Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


xuzifu666 commented on code in PR #9875:
URL: https://github.com/apache/hudi/pull/9875#discussion_r1362328775


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -157,52 +156,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
+  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
+} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
+  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));

Review Comment:
   > ok,in a condition that day = 2023-10-13 partition are 2000 
records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 
24,than query select count(1) from table where day='2023-10-13' or select * 
from table where day='2023-10-13',driv

Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


xuzifu666 commented on code in PR #9875:
URL: https://github.com/apache/hudi/pull/9875#discussion_r1362328775


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -157,52 +156,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
+  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
+} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
+  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));

Review Comment:
   > ok,in a condition that day = 2023-10-13 partition are 2000 
records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 
24,than query select count(1) from table where day='2023-10-13' or select * 
from table where day='2023-10-13',driv

Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766659212

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366)
 
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi Job fails fast in concurrent write even with high retries and long wait time [hudi]

2023-10-17 Thread via GitHub


ad1happy2go commented on issue #9728:
URL: https://github.com/apache/hudi/issues/9728#issuecomment-1766658535

   @SamarthRaval @Jason-liujc As discussed, the retry configuration is 
unrelated to the problem you are facing. The only way to handle such scenario's 
at this moment will be handling retries at your application level code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


xuzifu666 commented on code in PR #9875:
URL: https://github.com/apache/hudi/pull/9875#discussion_r1362323212


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -157,52 +156,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
+  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
+} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
+  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));

Review Comment:
   ok,in a condition that day = 2023-10-13 partition are 2000 
records(1kb per record),driver memory is 4gb ,sub parition 'hour' from 1 to 
24,than query select count(1) from table where day='2023-10-13' or select * 
from table where day='2023-10-13',driver

Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


xuzifu666 commented on code in PR #9875:
URL: https://github.com/apache/hudi/pull/9875#discussion_r1362323212


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -157,52 +156,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
+  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
+} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
+  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));

Review Comment:
   ok,in a condition that /2023-10-13 partition are 2000 records(1kb 
per record),driver memory is 4gb ,sub parition hour from 1 to 24,than query 
select count(1) from table where day='2023-10-13' or select * from table where 
day='2023-10-13',driver would 

Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


wecharyu commented on code in PR #9875:
URL: https://github.com/apache/hudi/pull/9875#discussion_r1362311711


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -157,52 +156,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path))
-.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, relativePartitionPath, 
urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
+  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
+} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
+  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));

Review Comment:
   @xuzifu666 "Processing listed partitions" will left the intermediate path to 
call listStatus in the next iterator, which is the same as community version 
now. 
   
   I have test the query `select count(1) from hudi_test where 
day='2023-10-17'`, which only lis

[jira] [Created] (HUDI-6951) Use spark3 profile to build hudi-aws-bundle jars for release artifacts

2023-10-17 Thread Akira Ajisaka (Jira)
Akira Ajisaka created HUDI-6951:
---

 Summary: Use spark3 profile to build hudi-aws-bundle jars for 
release artifacts
 Key: HUDI-6951
 URL: https://issues.apache.org/jira/browse/HUDI-6951
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Akira Ajisaka


When hudi-aws-bundle.jar and hudi-spark3.3-bundle_2.12.jar are used at the same 
time, and hudi-aws-bundle.jar is loaded first in the Spark runtime, it can 
fails by NoSuchMethodError:
{noformat}
py4j.protocol.Py4JJavaError: An error occurred while calling ***.
: java.lang.NoSuchMethodError: 
org.apache.hudi.avro.model.HoodieCleanMetadata.getTotalFilesDeleted()I
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557
{noformat}
The problem is, currently hudi-aws-bundle jar in Maven central repo is built 
against spark2 profile and Avro 1.8.2 is used to generate source code from Avro 
schema file. Then, the generated source code is like
{noformat}
public Integer getTotalFilesDeleted() {
return this.totalFilesDeleted;
}
{noformat}
on the other hand, hudi-spark3.3-bundle_2.12.jar is built with Avro 1.11.1, and 
the generated source code is like
{noformat}
public int getTotalFilesDeleted() {
return this.totalFilesDeleted;
}
{noformat}
Since Avro 1.9.0, it uses primitive type for generated getters/setters 
(AVRO-2069). Therefore, if hudi-aws-bundle is loaded first in the runtime, it 
can fail with the above NoSuchMethodError.

Although it can be fixed by changing the classpath loading order or building 
hudi-aws-bundle by your own, is it possible to provide 
hudi-aws-spark3.3-bundle.jar in Maven central? or, is it possible to build 
hudi-aws-bundle jar using spark3 profile by default given most of AWS customer 
now use Spark 3.x for their runtime?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT]Loss record when complete compaction [hudi]

2023-10-17 Thread via GitHub


ad1happy2go commented on issue #9869:
URL: https://github.com/apache/hudi/issues/9869#issuecomment-1766432963

   @15663671003 Can you please explain more in detail. 
   
   Can you try setting spark.sql.filesourceTableRelationCacheSize to 0 to avoid 
any possibility of cached relation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766431378

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365)
 
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366)
 
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766357395

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364)
 
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365)
 
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366)
 
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1766356922

   
   ## CI report:
   
   * 149dfda8469d598e3098c418ce1e7bf99a4a177f UNKNOWN
   * 66ea14a95621e003cbf81773c78f0ad2147bbbf6 UNKNOWN
   * 918ff90b4bc079e5053fcc8a3b3f0d472d30ca1e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20363)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy reassigned HUDI-6950:


Assignee: xy

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Assignee: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg, 
> oom_stages.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


xuzifu666 commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766255314

   > @wecharyu It is great if you have the review, @xuzifu666 can you 
supplement with more details, expecially the spark stages difference.
   
   sure,had add stages detail in issue 
https://issues.apache.org/jira/browse/HUDI-6950 @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-6950:
-
Attachment: oom_stages.jpg

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg, 
> oom_stages.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-6950:
-
Attachment: fix_stages.jpg

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg, fix_stages.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766236018

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364)
 
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365)
 
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366)
 
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766222435

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364)
 
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365)
 
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20366)
 
   * 118b8ea2524be1bdf5c540837a78d83bdee7fa62 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table first partition [hudi]

2023-10-17 Thread via GitHub


danny0405 commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766219389

   @wecharyu It is great if you have the review, @xuzifu666 can you supplement 
with more details, expecially the spark stages difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766157215

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364)
 
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20365)
 
   * 1d8c05e27a6d83320e2eedae074e1aba01146923 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776138#comment-17776138
 ] 

xy commented on HUDI-6950:
--

fix in https://github.com/apache/hudi/pull/9875/files

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Priority: Critical
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6950:
-
Labels: pull-request-available  (was: )

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.14.0
>Reporter: xy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: before_fix_dump_filestatus.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6950] Query should process listed partitions avoid driver oom due to large number files in table [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9875:
URL: https://github.com/apache/hudi/pull/9875#issuecomment-1766142135

   
   ## CI report:
   
   * eeb64f5c3c4a8ff572e0637d037cf4b4823db1e0 UNKNOWN
   * b783334d03f247b2e57ee788e3d019b14abf2b66 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20364)
 
   * e0f5b1e46f816ecd22e182ea58ba6454fb478cdf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6950) query should process listed partitions avoid driver oom due to large number files in table

2023-10-17 Thread xy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-6950:
-
Description: 
currently if multiple partition table,would cause oom easy
eg:
CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
{{id}} string,
{{name}} string,
{{dt}} bigint,
{{day}} STRING COMMENT '日期分区',
{{hour}} INT COMMENT '小时分区'
)using hudi
OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
'hoodie.datasource.meta.sync.enable' 'false', 
'hoodie.datasource.hive_sync.enable' 'false')
tblproperties (
'primaryKey' = 'id',
'type' = 'mor',
'preCombineField'='dt',
'hoodie.index.type' = 'BUCKET',
'hoodie.bucket.index.hash.field' = 'id',
'hoodie.bucket.index.num.buckets'=512
)
PARTITIONED BY ({{{}day{}}},{{{}hour{}}});

select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
day='2023-10-17' would list much filestatus to driver,and driver would oom(such 
as table with hundreds billion records in a partition(day='2023-10-17'))

> query should process listed partitions avoid driver oom due to large number 
> files in table
> --
>
> Key: HUDI-6950
> URL: https://issues.apache.org/jira/browse/HUDI-6950
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xy
>Priority: Critical
> Attachments: before_fix_dump_filestatus.jpg
>
>
> currently if multiple partition table,would cause oom easy
> eg:
> CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} (
> {{id}} string,
> {{name}} string,
> {{dt}} bigint,
> {{day}} STRING COMMENT '日期分区',
> {{hour}} INT COMMENT '小时分区'
> )using hudi
> OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', 
> 'hoodie.datasource.meta.sync.enable' 'false', 
> 'hoodie.datasource.hive_sync.enable' 'false')
> tblproperties (
> 'primaryKey' = 'id',
> 'type' = 'mor',
> 'preCombineField'='dt',
> 'hoodie.index.type' = 'BUCKET',
> 'hoodie.bucket.index.hash.field' = 'id',
> 'hoodie.bucket.index.num.buckets'=512
> )
> PARTITIONED BY ({{{}day{}}},{{{}hour{}}});
> select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where 
> day='2023-10-17' would list much filestatus to driver,and driver would 
> oom(such as table with hundreds billion records in a 
> partition(day='2023-10-17'))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >