Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10261:
URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859065053

   
   ## CI report:
   
   * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN
   * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21546)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859045969

   
   ## CI report:
   
   * a4a30405db038272466adea3ee51c2923fcd944e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21547)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859037963

   
   ## CI report:
   
   * 53875b1c03bc5d2c761e57c998a3fb1c2e038b01 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21548)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859037954

   
   ## CI report:
   
   * dd3845421a5be07fc18939cd23537a23662f7515 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542)
 
   * a4a30405db038272466adea3ee51c2923fcd944e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21547)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10261:
URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859037918

   
   ## CI report:
   
   * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN
   * c79d1211484ff1d8751cee0a01e60ca2b8cf84bf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21453)
 
   * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21546)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859032793

   
   ## CI report:
   
   * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543)
 
   * 53875b1c03bc5d2c761e57c998a3fb1c2e038b01 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859032784

   
   ## CI report:
   
   * dd3845421a5be07fc18939cd23537a23662f7515 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542)
 
   * a4a30405db038272466adea3ee51c2923fcd944e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5973] Fixing refreshing of schemas in HoodieStreamer continuous mode [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10261:
URL: https://github.com/apache/hudi/pull/10261#issuecomment-1859032737

   
   ## CI report:
   
   * 21b7f2c89745631fc854f11ca201080762728105 UNKNOWN
   * c79d1211484ff1d8751cee0a01e60ca2b8cf84bf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21453)
 
   * a6e71f3ef494421c08b895bf9bf0aa9c0395e351 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10342:
URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859029231

   
   ## CI report:
   
   * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN
   * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21544)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7183] Fix static insert overwrite partitions issue (#10254)

2023-12-16 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 27ac42818bc [HUDI-7183] Fix static insert overwrite partitions issue 
(#10254)
27ac42818bc is described below

commit 27ac42818bcf3768c2a6742ac0edfcb79c253e52
Author: Wechar Yu 
AuthorDate: Sun Dec 17 11:32:30 2023 +0800

[HUDI-7183] Fix static insert overwrite partitions issue (#10254)
---
 .../SparkInsertOverwriteCommitActionExecutor.java  | 17 ++--
 ...setBulkInsertOverwriteCommitActionExecutor.java | 18 ++--
 .../sql/catalyst/catalog/HoodieCatalogTable.scala  |  7 +-
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  | 83 ++
 .../command/InsertIntoHoodieTableCommand.scala | 32 +--
 .../apache/spark/sql/hudi/TestInsertTable.scala| 98 ++
 6 files changed, 177 insertions(+), 78 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
index d12efab229d..788e1040783 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkInsertOverwriteCommitActionExecutor.java
@@ -36,7 +36,7 @@ import org.apache.hudi.table.action.HoodieWriteMetadata;
 
 import org.apache.spark.Partitioner;
 
-import java.util.Collections;
+import java.util.Arrays;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
@@ -81,14 +81,15 @@ public class SparkInsertOverwriteCommitActionExecutor
 
   @Override
   protected Map> 
getPartitionToReplacedFileIds(HoodieWriteMetadata> 
writeMetadata) {
-if (writeMetadata.getWriteStatuses().isEmpty()) {
-  String staticOverwritePartition = 
config.getStringOrDefault(HoodieInternalConfig.STATIC_OVERWRITE_PARTITION_PATHS);
-  if (StringUtils.isNullOrEmpty(staticOverwritePartition)) {
-return Collections.emptyMap();
-  } else {
-return Collections.singletonMap(staticOverwritePartition, 
getAllExistingFileIds(staticOverwritePartition));
-  }
+String staticOverwritePartition = 
config.getStringOrDefault(HoodieInternalConfig.STATIC_OVERWRITE_PARTITION_PATHS);
+if (StringUtils.nonEmpty(staticOverwritePartition)) {
+  // static insert overwrite partitions
+  List partitionPaths = 
Arrays.asList(staticOverwritePartition.split(","));
+  context.setJobStatus(this.getClass().getSimpleName(), "Getting 
ExistingFileIds of matching static partitions");
+  return 
HoodieJavaPairRDD.getJavaPairRDD(context.parallelize(partitionPaths, 
partitionPaths.size()).mapToPair(
+  partitionPath -> Pair.of(partitionPath, 
getAllExistingFileIds(partitionPath.collectAsMap();
 } else {
+  // dynamic insert overwrite partitions
   return 
HoodieJavaPairRDD.getJavaPairRDD(writeMetadata.getWriteStatuses().map(status -> 
status.getStat().getPartitionPath()).distinct().mapToPair(partitionPath ->
   Pair.of(partitionPath, 
getAllExistingFileIds(partitionPath.collectAsMap();
 }
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java
 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java
index c1fd952b106..67ba2027cbd 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteCommitActionExecutor.java
@@ -26,6 +26,7 @@ import org.apache.hudi.common.model.FileSlice;
 import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieInternalConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
@@ -33,7 +34,7 @@ import org.apache.hudi.data.HoodieJavaPairRDD;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
-import java.util.Collections;
+import java.util.Arrays;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
@@ -60,14 +61,15 @@ public class DatasetBulkInsertOverwriteCommitActionExecutor 
extends BaseDatasetB
 
   @Override
   protected Map> 
getPartitionToReplacedFileIds(HoodieData writeStatuses) {
-if (writeStatuses.isEmpty()) {
-  String staticOverwritePartition = 
writeConfig.

Re: [PR] [HUDI-7183] Fix static insert overwrite partitions issue [hudi]

2023-12-16 Thread via GitHub


bvaradar merged PR #10254:
URL: https://github.com/apache/hudi/pull/10254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10342:
URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859021797

   
   ## CI report:
   
   * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN
   * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21544)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10342:
URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859020883

   
   ## CI report:
   
   * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN
   * 7c3ea778cc509ea71d9b837d4b228bb07abf18b2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859020875

   
   ## CI report:
   
   * dd3845421a5be07fc18939cd23537a23662f7515 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10342:
URL: https://github.com/apache/hudi/pull/10342#issuecomment-1859014631

   
   ## CI report:
   
   * cb62ad9bed32bf3acc6f8227e5e824cb73e8f0e4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync

2023-12-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7237:
-
Labels: pull-request-available  (was: )

> Minor Improvements to Schema Handling in Delta Sync
> ---
>
> Key: HUDI-7237
> URL: https://issues.apache.org/jira/browse/HUDI-7237
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> There are a two minor items that we have run into running DeltaStreamer in 
> production.
> 1. The number of times the schema is fetched is more than it needs to be and 
> can put unnecessary load on schema providers or increase file system reads
> 2. SchemaProviders that return null target schemas on empty batches cause 
> null schema values in commits leading to unexpected issues later
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups [hudi]

2023-12-16 Thread via GitHub


the-other-tim-brown opened a new pull request, #10342:
URL: https://github.com/apache/hudi/pull/10342

   ### Change Logs
   
   - Reduces number of calls to schema provider by properly using `orElseGet` 
in one case and reusing existing writer config instead of recomputing
   - Reuses existing hadoop configuration instance instead of creating new one 
when possible
   - Handles case where schema provider returns null and there is an empty batch
   
   ### Impact
   
   - Reduces number of calls to schema providers which can reduce load on the 
provider or cost of accessing file system
   - Fixes edge case where an empty batch could write a commit with a null 
value as the schema
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859012003

   
   ## CI report:
   
   * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync

2023-12-16 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-7237:
---

 Summary: Minor Improvements to Schema Handling in Delta Sync
 Key: HUDI-7237
 URL: https://issues.apache.org/jira/browse/HUDI-7237
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Timothy Brown


There are a two minor items that we have run into running DeltaStreamer in 
production.
1. The number of times the schema is fetched is more than it needs to be and 
can put unnecessary load on schema providers or increase file system reads

2. SchemaProviders that return null target schemas on empty batches cause null 
schema values in commits leading to unexpected issues later

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859003258

   
   ## CI report:
   
   * b319a7c193a79252dd18c5df782667d96c47a64a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541)
 
   * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21543)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1859001927

   
   ## CI report:
   
   * b319a7c193a79252dd18c5df782667d96c47a64a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541)
 
   * 7fa7aea0f00ac1592556c40f6a72817b973fa5b1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859001909

   
   ## CI report:
   
   * 380da64e8a28ae34f4315f6df34367f7f0c74cdd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21537)
 
   * dd3845421a5be07fc18939cd23537a23662f7515 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21542)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1859000751

   
   ## CI report:
   
   * 380da64e8a28ae34f4315f6df34367f7f0c74cdd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21537)
 
   * dd3845421a5be07fc18939cd23537a23662f7515 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7236] Allow MIT to change partition path when using global index [hudi]

2023-12-16 Thread via GitHub


nsivabalan commented on code in PR #10337:
URL: https://github.com/apache/hudi/pull/10337#discussion_r1428851434


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -182,6 +182,82 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
 }
   }
 
+
+  /**
+   * Test MIT with global index.
+   * HUDI-7131
+   */
+  test("Test HUDI-7131") {

Review Comment:
   lets name the test w/ some context. 
   "Test MergeInto with global index" 



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,6 +337,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
 })
   }
 
+  test("Test MergeInto with changing partition") {
+withRecordType()(withTempDir { tmp =>
+  withSQLConf("hoodie.index.type" -> "GLOBAL_SIMPLE") {
+val sourceTable = generateTableName
+val targetTable = generateTableName
+spark.sql(
+  s"""
+ | create table $sourceTable
+ | using parquet
+ | partitioned by (partition)
+ | location '${tmp.getCanonicalPath}/$sourceTable'
+ | as
+ | select
+ | 1 as id,
+ | 2 as version,
+ | 'yes' as mergeCond,
+ | '2023-10-02' as partition
+  """.stripMargin
+)
+spark.sql(s"insert into $sourceTable values(2, 2, 'no', '2023-10-02')")
+spark.sql(s"insert into $sourceTable values(3, 1, 'insert', 
'2023-10-01')")
+
+spark.sql(
+  s"""
+ | create table $targetTable (
+ |  id int,
+ |  version int,
+ |  mergeCond string,
+ |  partition string
+ | ) using hudi
+ | partitioned by (partition)
+ | tblproperties (
+ |'primaryKey' = 'id',
+ |'type' = 'cow'
+ | )
+ | location '${tmp.getCanonicalPath}/$targetTable'
+   """.stripMargin)
+
+spark.sql(s"insert into $targetTable values(1, 1, 'insert', 
'2023-10-01')")
+spark.sql(s"insert into $targetTable values(2, 1, 'insert', 
'2023-10-01')")
+
+spark.sql(
+  s"""
+ | merge into $targetTable t using
+ | (select * from $sourceTable) as s
+ | on t.id=s.id
+ | when matched and s.mergeCond = 'yes' then update set *
+ | when not matched then insert *
+   """.stripMargin)
+checkAnswer(s"select id,version,partition from $targetTable order by 
id")(
+  Seq(1, 2, "2023-10-02"),
+  Seq(2, 1, "2023-10-01"),

Review Comment:
   why s.mergeCond is required in the match clause? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -243,6 +247,56 @@ private static  HoodieData> 
getExistingRecords(
 .getMergedRecords().iterator());
   }
 
+  /**
+   * getExistingRecords will create records with expression payload so we 
overwrite the config.
+   * Additionally, we don't want to restore this value because the write will 
fail later on.
+   * We also need the keygenerator so we can figure out the partition path 
after expression payload
+   * evaluates the merge.
+   */
+  private static BaseKeyGenerator 
maybeGetKeygenAndUpdatePayload(HoodieWriteConfig config) {
+if 
(config.getPayloadClass().equals("org.apache.spark.sql.hudi.command.payload.ExpressionPayload"))
 {
+  config.setValue(HoodiePayloadConfig.PAYLOAD_CLASS_NAME.key(), 
HoodiePayloadConfig.PAYLOAD_CLASS_NAME.defaultValue());
+  try {
+return (BaseKeyGenerator) 
HoodieAvroKeyGeneratorFactory.createKeyGenerator(config.getProps());
+  } catch (IOException e) {
+throw new RuntimeException("KeyGenerator must inherit from 
BaseKeyGenerator to update a records partition path using spark sql merge 
into", e);
+  }
+}
+return null;
+  }
+
+  /**
+   * Special merge handling for MIT
+   * We need to wait until after merging before we can add meta fields because
+   * ExpressionPayload does not allow rewriting
+   */
+  private static  Option> 
mergeIncomingWithExistingRecordWithExpressionPayload(
+  HoodieRecord incoming,
+  HoodieRecord existing,
+  Schema writeSchema,
+  Schema existingSchema,
+  Schema writeSchemaWithMetaFields,
+  HoodieWriteConfig config,
+  HoodieRecordMerger recordMerger,
+  BaseKeyGenerator keyGenerator) throws IOException {
+Option> mergeResult = 
recordMerger.merge(existing, existingSchema,

Review Comment:
   so, we don't need to meta fields for this merge is it?
   but for the merge in L323, we need the meta fields prepended. 



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,6 +337,65 @@ cl

Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader (0.x branch) [hudi]

2023-12-16 Thread via GitHub


nsivabalan commented on PR #10339:
URL: https://github.com/apache/hudi/pull/10339#issuecomment-1858857129

   @linliu-code : yes, we do have a fix for master here here 
https://github.com/apache/hudi/pull/10340 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] ceshi [hudi]

2023-12-16 Thread via GitHub


zoomake closed issue #10341: ceshi 
URL: https://github.com/apache/hudi/issues/10341


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1858800572

   > Overall it looks solid to me. Only concern is that this code is very 
critical and complex, do we have enough tests to ensure the correctness?
   
   We have all the tests for each use case in `TestInputFormat` and 
`TestIncrementalInputSplits`, they are currently Flink UTs, somehow we need to 
refactoring the tests into the `hudi-common`. (an engine agnostic data set 
validation).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428788123


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,

Review Comment:
   It could be, and that would induce empty dataset.



-- 
This is an automated message from the Apache

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787998


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTimes(readTimeline, rangeStart, rangeEnd, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  @VisibleForTesting
+  public List getStartTimes(
+  String rangeStart,
+  String rangeEnd,
+  Function earliestInstantTimeFunc) {
+return 
getStartTimes(metaClient.getCommitsTimeline().filterCompletedInstants(), 
Option.ofNullable(rangeStart), Option.ofNullable(rangeEnd),
+InstantRange.RangeType.CLOSE_CLOSE, earliestInstantTimeFunc);
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param readTimelineThe read timeline.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  public List getStartTimes(

Review Comment:
   The completion time is user specified, the main purpose of this clazz is 
transfering the ocmpletion time range into instant time range.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787710


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787379


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787494


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787116


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786843


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -166,6 +169,28 @@ public boolean isInRange(String instant) {
 }
   }
 
+  /**
+   * Composition of multiple instant ranges in disjunctive form.

Review Comment:
   Yeah, the `conjunctive` and `disjunctive` is terminology for relation 
algebra(SQL), `disjunctive` usually means `OR`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786517


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {

Review Comment:
   Kind of, we can rename to better one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark SQL insert to write hudI table. Metadata sync failed. [hudi]

2023-12-16 Thread via GitHub


AnchorAnim commented on issue #4903:
URL: https://github.com/apache/hudi/issues/4903#issuecomment-1858789575

   请问这个问题你当时怎么解决的?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]

2023-12-16 Thread via GitHub


hudi-bot commented on PR #10340:
URL: https://github.com/apache/hudi/pull/10340#issuecomment-1858776223

   
   ## CI report:
   
   * b319a7c193a79252dd18c5df782667d96c47a64a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21541)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org