[GitHub] [hudi] ad1happy2go commented on issue #8253: [SUPPORT]HoodieJavaWriteClientExample Process finished with exit code 137 (interrupted by signal 9: SIGKILL) with jol-core 0.16

2023-07-04 Thread via GitHub


ad1happy2go commented on issue #8253:
URL: https://github.com/apache/hudi/issues/8253#issuecomment-1621063805

   @Mulavar Sorry for delay on this, But I am able to successfully run the 
HoodieJavaWriteClientExample with the this jdk version. Looks to be laptop 
issue only , so closing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1621056203

   
   ## CI report:
   
   * ef7585ba8d32d772500f31f95f3c04bfcac046e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18326)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9122:
URL: https://github.com/apache/hudi/pull/9122#issuecomment-1621023436

   
   ## CI report:
   
   * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18324)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1621023467

   
   ## CI report:
   
   * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9122:
URL: https://github.com/apache/hudi/pull/9122#discussion_r1252517410


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -144,18 +144,7 @@ public BaseHoodieTableFileIndex(HoodieEngineContext 
engineContext,
 this.engineContext = engineContext;
 this.fileStatusCache = fileStatusCache;
 
-// The `shouldListLazily` variable controls how we initialize the 
TableFileIndex:
-//  - non-lazy/eager listing (shouldListLazily=false):  all partitions and 
file slices will be loaded eagerly during initialization.
-//  - lazy listing (shouldListLazily=true): partitions listing will be 
done lazily with the knowledge from query predicate on partition
-//columns. And file slices fetching only happens for partitions 
satisfying the given filter.
-//
-// In SparkSQL, `shouldListLazily` is controlled by option 
`REFRESH_PARTITION_AND_FILES_IN_INITIALIZATION`.
-// In lazy listing case, if no predicate on partition is provided, all 
partitions will still be loaded.
-if (shouldListLazily) {
-  this.tableMetadata = createMetadataTable(engineContext, metadataConfig, 
basePath);

Review Comment:
   Ignore, it is created in `doRefresh`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9122:
URL: https://github.com/apache/hudi/pull/9122#discussion_r1252517112


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -144,18 +144,7 @@ public BaseHoodieTableFileIndex(HoodieEngineContext 
engineContext,
 this.engineContext = engineContext;
 this.fileStatusCache = fileStatusCache;
 
-// The `shouldListLazily` variable controls how we initialize the 
TableFileIndex:
-//  - non-lazy/eager listing (shouldListLazily=false):  all partitions and 
file slices will be loaded eagerly during initialization.
-//  - lazy listing (shouldListLazily=true): partitions listing will be 
done lazily with the knowledge from query predicate on partition
-//columns. And file slices fetching only happens for partitions 
satisfying the given filter.
-//
-// In SparkSQL, `shouldListLazily` is controlled by option 
`REFRESH_PARTITION_AND_FILES_IN_INITIALIZATION`.
-// In lazy listing case, if no predicate on partition is provided, all 
partitions will still be loaded.
-if (shouldListLazily) {
-  this.tableMetadata = createMetadataTable(engineContext, metadataConfig, 
basePath);

Review Comment:
   The initialization of `tableMetadata` is removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6476) Improve the performance of getAllPartitionPaths

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6476.

Resolution: Fixed

Fixed via master branch: 72f047715fe8f2ad9ff19a31728fbfb761fbe0d9

> Improve the performance of getAllPartitionPaths
> ---
>
> Key: HUDI-6476
> URL: https://issues.apache.org/jira/browse/HUDI-6476
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: hudi-utilities
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: After improvement.png, Before improvement.png
>
>
> Currently Hudi will list all status of files in hudi table directory, which 
> can be avoid to improve the performance of getAllPartitionPaths, especially 
> for the non-partitioned table with many files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6476) Improve the performance of getAllPartitionPaths

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6476:
-
Fix Version/s: 0.14.0

> Improve the performance of getAllPartitionPaths
> ---
>
> Key: HUDI-6476
> URL: https://issues.apache.org/jira/browse/HUDI-6476
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: hudi-utilities
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: After improvement.png, Before improvement.png
>
>
> Currently Hudi will list all status of files in hudi table directory, which 
> can be avoid to improve the performance of getAllPartitionPaths, especially 
> for the non-partitioned table with many files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)

2023-07-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 72f047715fe [HUDI-6476] Improve the performance of 
getAllPartitionPaths (#9121)
72f047715fe is described below

commit 72f047715fe8f2ad9ff19a31728fbfb761fbe0d9
Author: Wechar Yu 
AuthorDate: Wed Jul 5 12:14:24 2023 +0800

[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)

Currently Hudi will list all status of files in hudi table directory, which 
can be avoid to improve the performance of #getAllPartitionPaths, especially 
for the non-partitioned table with many files. What we change in this patch:

* reduce a stage in getPartitionPathWithPathPrefix()
* only check directory to find the PartitionMetadata
* avoid listStatus of .hoodie/.hoodie_partition_metadata
---
 .../metadata/FileSystemBackedTableMetadata.java| 52 +-
 1 file changed, 22 insertions(+), 30 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
index 69c237d6684..6a6f46a65ef 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
@@ -47,6 +47,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.concurrent.CopyOnWriteArrayList;
 import java.util.stream.Collectors;
+import java.util.stream.Stream;
 
 /**
  * Implementation of {@link HoodieTableMetadata} based file-system-backed 
table metadata.
@@ -106,42 +107,33 @@ public class FileSystemBackedTableMetadata implements 
HoodieTableMetadata {
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel
+  // List all directories in parallel:
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
+  // and second entry holds optionally a directory path to be processed 
further.
+  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
+  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), path)), Option.empty()));
+}
+return Arrays.stream(fileSystem.listStatus(path, p -> {
+  try {
+return fileSystem.isDirectory(p) && 
!p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME);
+  } catch (IOException e) {
+// noop
+  }
+  return false;
+})).map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
   }, listingParallelism);
   pathsToList.clear();
 
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
-  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
-  if (!dirToFileListing.isEmpty()) {
-// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
-// and second entry holds optionally a directory path to be processed 
further.
-engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
-List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
-  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
-  if (fileStatus.isDirectory()) {
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
-  return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), fileStatus.getPath())), Option.empty());
-} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
-  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
-}
-  } else if 

[GitHub] [hudi] danny0405 merged pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


danny0405 merged PR #9121:
URL: https://github.com/apache/hudi/pull/9121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9121:
URL: https://github.com/apache/hudi/pull/9121#discussion_r1252515100


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -106,42 +107,33 @@ private List 
getPartitionPathWithPathPrefix(String relativePathPrefix) t
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel
+  // List all directories in parallel:
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
+  // and second entry holds optionally a directory path to be processed 
further.
+  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
+  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), path)), Option.empty()));
+}
+return Arrays.stream(fileSystem.listStatus(path, p -> {
+  try {
+return fileSystem.isDirectory(p) && 
!p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME);
+  } catch (IOException e) {
+// noop
+  }
+  return false;
+})).map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
   }, listingParallelism);
   pathsToList.clear();
 
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
-  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
-  if (!dirToFileListing.isEmpty()) {
-// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
-// and second entry holds optionally a directory path to be processed 
further.
-engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
-List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
-  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
-  if (fileStatus.isDirectory()) {
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
-  return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), fileStatus.getPath())), Option.empty());
-} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
-  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
-}
-  } else if 
(fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
 {
-String partitionName = FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), fileStatus.getPath().getParent());
-return Pair.of(Option.of(partitionName), Option.empty());
-  }
-  return Pair.of(Option.empty(), Option.empty());
-}, fileListingParallelism);
-
-partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent()).map(entry -> entry.getKey().get())
-.collect(Collectors.toList()));
+  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent()).map(entry -> entry.getKey().get())
+  .collect(Collectors.toList()));

Review Comment:
   good point, the code looks much simpler!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-04 Thread via GitHub


flashJd commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620990951

   I'm confused why `insert overwrite hudi_cow_pt_tbl select 13, 'a13', 1100,  
'2021-12-09', '12' is a not dynamic partition writing?`
   the semantics should can be controled by config. 
   we shoule clarify the conception static partition and dymanic partition
   1)https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite
   iceberg dynamic and static partiton overwrite semantics 
   2)https://docs.databricks.com/delta/selective-overwrite.html#language-sql
   delta lake dynamic partiton overwrite semantics 
   3)1)https://hudi.apache.org/cn/docs/quick-start-guide/#insert-overwrite
   -- insert overwrite partitioned table with dynamic partition
   insert overwrite table hudi_cow_pt_tbl select 10, 'a10', 1100, '2021-12-09', 
'10';
   
   -- insert overwrite partitioned table with static partition
   insert overwrite hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='12') 
select 13, 'a13', 1100;
   
   @nsivabalan @yihua @XuQianJin-Stars @KnightChess 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620980724

   
   ## CI report:
   
   * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313)
 
   * 48608a9eafa20f9fde6d414a4b4de50a2bcf6050 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18329)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620976354

   
   ## CI report:
   
   * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313)
 
   * 48608a9eafa20f9fde6d414a4b4de50a2bcf6050 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9106:
URL: https://github.com/apache/hudi/pull/9106#discussion_r1252501322


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -91,8 +108,9 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 .withCleanConfig(HoodieCleanConfig.newBuilder()
 .withAsyncClean(DEFAULT_METADATA_ASYNC_CLEAN)
 .withAutoClean(false)
-.withCleanerParallelism(parallelism)
-.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+.withCleanerParallelism(defaultParallelism)
+.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS)
+.retainFileVersions(2)

Review Comment:
   Even if Uber has been running for 6+ months, it does not mean the config 
work well for OSS, because while we migrating the Uber patches, many fixes and 
other nuances are introduced, I would suggest we move this change to the next 
release to keep the stability of existing MDT workflow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #8837:
URL: https://github.com/apache/hudi/pull/8837#discussion_r1252500090


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -851,26 +919,49 @@ public void update(HoodieRestoreMetadata restoreMetadata, 
String instantTime) {
*/
   @Override
   public void update(HoodieRollbackMetadata rollbackMetadata, String 
instantTime) {
-if (enabled && metadata != null) {
-  // Is this rollback of an instant that has been synced to the metadata 
table?
-  String rollbackInstant = rollbackMetadata.getCommitsRollback().get(0);
-  boolean wasSynced = 
metadataMetaClient.getActiveTimeline().containsInstant(new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, rollbackInstant));
-  if (!wasSynced) {
-// A compaction may have taken place on metadata table which would 
have included this instant being rolled back.
-// Revisit this logic to relax the compaction fencing : 
https://issues.apache.org/jira/browse/HUDI-2458
-Option latestCompaction = metadata.getLatestCompactionTime();
-if (latestCompaction.isPresent()) {
-  wasSynced = HoodieTimeline.compareTimestamps(rollbackInstant, 
HoodieTimeline.LESSER_THAN_OR_EQUALS, latestCompaction.get());
-}
+// The commit which is being rolled back on the dataset
+final String commitInstantTime = 
rollbackMetadata.getCommitsRollback().get(0);
+// Find the deltacommits since the last compaction
+Option> deltaCommitsInfo =
+
CompactionUtils.getDeltaCommitsSinceLatestCompaction(metadataMetaClient.getActiveTimeline());
+if (!deltaCommitsInfo.isPresent()) {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no deltacommits on MDT", commitInstantTime, instantTime));
+  return;
+}
+
+// This could be a compaction or deltacommit instant (See 
CompactionUtils.getDeltaCommitsSinceLatestCompaction)
+HoodieInstant compactionInstant = deltaCommitsInfo.get().getValue();
+HoodieTimeline deltacommitsSinceCompaction = 
deltaCommitsInfo.get().getKey();
+
+// The deltacommit that will be rolled back
+HoodieInstant deltaCommitInstant = new HoodieInstant(false, 
HoodieTimeline.DELTA_COMMIT_ACTION, commitInstantTime);
+
+// The commit being rolled back should not be older than the latest 
compaction on the MDT. Compaction on MDT only occurs when all actions
+// are completed on the dataset. Hence, this case implies a rollback of 
completed commit which should actually be handled using restore.
+if (compactionInstant.getAction().equals(HoodieTimeline.COMMIT_ACTION)) {
+  final String compactionInstantTime = compactionInstant.getTimestamp();
+  if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(commitInstantTime, 
compactionInstantTime)) {
+throw new HoodieMetadataException(String.format("Commit being rolled 
back %s is older than the latest compaction %s. "
++ "There are %d deltacommits after this compaction: %s", 
commitInstantTime, compactionInstantTime,
+deltacommitsSinceCompaction.countInstants(), 
deltacommitsSinceCompaction.getInstants()));
   }
+}
 
-  Map> records =
-  HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
-  rollbackMetadata, getRecordsGenerationParams(), instantTime,
-  metadata.getSyncedInstantTime(), wasSynced);
-  commit(instantTime, records, false);
-  closeInternal();
+if (deltaCommitsInfo.get().getKey().containsInstant(deltaCommitInstant)) {
+  LOG.info("Rolling back MDT deltacommit " + commitInstantTime);
+  if (!getWriteClient().rollback(commitInstantTime, instantTime)) {
+throw new HoodieMetadataException("Failed to rollback deltacommit at " 
+ commitInstantTime);
+  }
+} else {
+  LOG.info(String.format("Ignoring rollback of instant %s at %s since 
there are no corresponding deltacommits on MDT",
+  commitInstantTime, instantTime));
 }
+
+// Rollback of MOR table may end up adding a new log file. So we need to 
check for added files and add them to MDT
+processAndCommit(instantTime, () -> 
HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, 
metadataMetaClient.getActiveTimeline(),
+rollbackMetadata, getRecordsGenerationParams(), instantTime,
+metadata.getSyncedInstantTime(), true), false);

Review Comment:
   Discussed offline, we need to track the inflight log files for cleaning 
anyway, but we have no good manner to fix that currently, needs think through ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:

[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620970058

   
   ## CI report:
   
   * 984c3d691c3e7915fb1333ee823a641098774270 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318)
 
   * ef7585ba8d32d772500f31f95f3c04bfcac046e7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18326)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9119: [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13

2023-07-04 Thread via GitHub


danny0405 commented on issue #9119:
URL: https://github.com/apache/hudi/issues/9119#issuecomment-1620968741

   Sorry for the unstability, we will be more conservative about code reviewing 
and merging in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1252497920


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -112,6 +113,36 @@ trait ProvidesHoodieConfig extends Logging {
 }
   }
 
+  private def deducePayloadClassNameLegacy(operation: String, tableType: 
String, insertMode: InsertMode): String = {
+if (operation == UPSERT_OPERATION_OPT_VAL &&
+  tableType == COW_TABLE_TYPE_OPT_VAL && insertMode == InsertMode.STRICT) {
+  // Validate duplicate key for COW, for MOR it will do the merge with the 
DefaultHoodieRecordPayload
+  // on reading.
+  // TODO use HoodieSparkValidateDuplicateKeyRecordMerger when 
SparkRecordMerger is default
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else if (operation == INSERT_OPERATION_OPT_VAL && tableType == 
COW_TABLE_TYPE_OPT_VAL &&
+  insertMode == InsertMode.STRICT){
+  // Validate duplicate key for inserts to COW table when using strict 
insert mode.
+  classOf[ValidateDuplicateKeyPayload].getCanonicalName
+} else {
+  classOf[OverwriteWithLatestAvroPayload].getCanonicalName
+}

Review Comment:
   By default, should we use `DefaultHoodieRecordPayload` instead ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


danny0405 commented on code in PR #9123:
URL: https://github.com/apache/hudi/pull/9123#discussion_r1252496143


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1094,6 +1094,11 @@ object HoodieSparkSqlWriter {
 if (mergedParams.contains(PRECOMBINE_FIELD.key())) {
   mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, 
mergedParams(PRECOMBINE_FIELD.key()))
 }
+if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && 
mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+  && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get 
!= FAIL_INSERT_DUP_POLICY) {
+  // enable merge allow duplicates when operation type is insert
+  
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(),
 "true")

Review Comment:
   I feel by default, we should never dedup for INSERT operation. That keeps 
the behavior in line with regular RDBMS.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6475) Optimize TableNotFoundException message

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6475:
-
Fix Version/s: 0.14.0

> Optimize TableNotFoundException message
> ---
>
> Key: HUDI-6475
> URL: https://issues.apache.org/jira/browse/HUDI-6475
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6475) Optimize TableNotFoundException message

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6475.

Resolution: Fixed

Fixed via master branch: 2322ac9d22784df2ccebcbdf898286c16fe0c211

> Optimize TableNotFoundException message
> ---
>
> Key: HUDI-6475
> URL: https://issues.apache.org/jira/browse/HUDI-6475
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6475] Optimize TableNotFoundException message (#9120)

2023-07-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2322ac9d227 [HUDI-6475] Optimize TableNotFoundException message (#9120)
2322ac9d227 is described below

commit 2322ac9d22784df2ccebcbdf898286c16fe0c211
Author: huangxiaoping <1754789...@qq.com>
AuthorDate: Wed Jul 5 11:18:04 2023 +0800

[HUDI-6475] Optimize TableNotFoundException message (#9120)
---
 .../src/main/java/org/apache/hudi/DataSourceUtils.java| 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
index c9c10fd7c7e..47a45479c09 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
@@ -55,9 +55,11 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
+import java.util.stream.Collectors;
 
 import static 
org.apache.hudi.common.util.CommitUtils.getCheckpointValueAsString;
 
@@ -81,7 +83,7 @@ public class DataSourceUtils {
   }
 }
 
-throw new TableNotFoundException("Unable to find a hudi table for the user 
provided paths.");
+throw new 
TableNotFoundException(Arrays.stream(userProvidedPaths).map(Path::toString).collect(Collectors.joining(",")));
   }
 
   /**



[GitHub] [hudi] danny0405 merged pull request #9120: [HUDI-6475] Optimize TableNotFoundException message

2023-07-04 Thread via GitHub


danny0405 merged PR #9120:
URL: https://github.com/apache/hudi/pull/9120


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


danny0405 commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620954248

   Thanks for the contribution, it is greate if we can have details to explain 
that can help the reviewers to get the context more quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6329) Introduce UpdateStrategy for Flink to handle conflict between clustering/resize with update

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6329.

Resolution: Fixed

Fixed via master branch: e8b1ddd708bc2ba99144f92d7533c7200f12509f

> Introduce UpdateStrategy for Flink to handle conflict between 
> clustering/resize with update
> ---
>
> Key: HUDI-6329
> URL: https://issues.apache.org/jira/browse/HUDI-6329
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: flink, index
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6329) Introduce UpdateStrategy for Flink to handle conflict between clustering/resize with update

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6329:
-
Fix Version/s: 0.14.0

> Introduce UpdateStrategy for Flink to handle conflict between 
> clustering/resize with update
> ---
>
> Key: HUDI-6329
> URL: https://issues.apache.org/jira/browse/HUDI-6329
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: flink, index
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index (#9087)

2023-07-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e8b1ddd708b [HUDI-6329] Adjust the partitioner automatically for flink 
consistent hashing index (#9087)
e8b1ddd708b is described below

commit e8b1ddd708bc2ba99144f92d7533c7200f12509f
Author: Jing Zhang 
AuthorDate: Wed Jul 5 11:09:25 2023 +0800

[HUDI-6329] Adjust the partitioner automatically for flink consistent 
hashing index (#9087)

* partitioner would detect new completed resize plan in #snapshotState
* disable scheduling resize plan for insert write pipelines with consistent 
bucket index
---
 ...sistentHashingBucketClusteringPlanStrategy.java |   4 +-
 .../action/cluster/strategy/UpdateStrategy.java|   4 +-
 .../util/ConsistentHashingUpdateStrategyUtils.java | 107 +++
 ...arkConsistentBucketDuplicateUpdateStrategy.java |  71 +-
 .../apache/hudi/configuration/OptionsResolver.java |  26 +++-
 .../org/apache/hudi/sink/StreamWriteFunction.java  |  34 +++--
 .../sink/bucket/BucketStreamWriteFunction.java |   2 +-
 .../sink/bucket/BucketStreamWriteOperator.java |   5 +-
 .../bucket/ConsistentBucketAssignFunction.java |  30 -
 .../ConsistentBucketStreamWriteFunction.java   |  83 
 .../FlinkConsistentBucketUpdateStrategy.java   | 150 +
 .../java/org/apache/hudi/sink/utils/Pipelines.java |   2 +-
 .../java/org/apache/hudi/util/ClusteringUtil.java  |   5 +-
 .../org/apache/hudi/util/FlinkWriteClients.java|   8 +-
 .../org/apache/hudi/sink/TestWriteMergeOnRead.java |  40 ++
 .../bucket/ITTestConsistentBucketStreamWrite.java  |  23 +++-
 .../utils/BucketStreamWriteFunctionWrapper.java|  18 ++-
 ...ConsistentBucketStreamWriteFunctionWrapper.java |  81 +++
 .../apache/hudi/sink/utils/ScalaCollector.java}|  32 +++--
 .../sink/utils/StreamWriteFunctionWrapper.java |  22 ---
 .../test/java/org/apache/hudi/utils/TestData.java  |   7 +-
 21 files changed, 611 insertions(+), 143 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java
index 59f9fcb81d1..49ab5f181ad 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/BaseConsistentHashingBucketClusteringPlanStrategy.java
@@ -85,7 +85,7 @@ public abstract class 
BaseConsistentHashingBucketClusteringPlanStrategy 
p.getLeft().getPartitionPath().equals(partition));
 if (isPartitionInClustering) {
-  LOG.info("Partition: " + partition + " is already in clustering, skip");
+  LOG.info("Partition {} is already in clustering, skip.", partition);
   return Stream.empty();
 }
 
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java
index 4463f7887bb..1c61db4b572 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/UpdateStrategy.java
@@ -32,8 +32,8 @@ import java.util.Set;
 public abstract class UpdateStrategy implements Serializable {
 
   protected final transient HoodieEngineContext engineContext;
-  protected final HoodieTable table;
-  protected final Set fileGroupsInPendingClustering;
+  protected HoodieTable table;
+  protected Set fileGroupsInPendingClustering;
 
   public UpdateStrategy(HoodieEngineContext engineContext, HoodieTable table, 
Set fileGroupsInPendingClustering) {
 this.engineContext = engineContext;
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java
new file mode 100644
index 000..f8351d2fa93
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/util/ConsistentHashingUpdateStrategyUtils.java
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you 

[GitHub] [hudi] danny0405 merged pull request #9087: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index

2023-07-04 Thread via GitHub


danny0405 merged PR #9087:
URL: https://github.com/apache/hudi/pull/9087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9087: [HUDI-6329] Adjust the partitioner automatically for flink consistent hashing index

2023-07-04 Thread via GitHub


danny0405 commented on PR #9087:
URL: https://github.com/apache/hudi/pull/9087#issuecomment-1620951822

   Tests have passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18287=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6423) Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6423.

Resolution: Fixed

Fixed via master branch: 07164406c44b4092eee810710a242d092c97bd58

> Incremental cleaning should consider inflight compaction instant
> 
>
> Key: HUDI-6423
> URL: https://issues.apache.org/jira/browse/HUDI-6423
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zhuanshenbsj1
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6423] Incremental cleaning should consider inflight compaction instant (#9038)

2023-07-04 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 07164406c44 [HUDI-6423] Incremental cleaning should consider inflight 
compaction instant  (#9038)
07164406c44 is described below

commit 07164406c44b4092eee810710a242d092c97bd58
Author: zhuanshenbsj1 <34104400+zhuanshenb...@users.noreply.github.com>
AuthorDate: Wed Jul 5 11:05:57 2023 +0800

[HUDI-6423] Incremental cleaning should consider inflight compaction 
instant  (#9038)

* The CleanPlanner#getEarliestCommitToRetain should consider pending 
compaction instants. If the pending compaction got missed under incremental 
cleaning mode, some files may never be cleaned when the cleaner moved to a 
different partition:

  par1   | - par2 ->
dc.1 compaction.2 dc.3 | dc.4

Assumes we have 3 delta commits and 1 pending compaction commit on the 
timeline, if the `EarliestCommitToRetain ` was recorded to dc.3, when the 
dc4(or subsequent instants) triggers cleaning, the cleaner just checks the 
timeline with dc.3, and the compaction.2 got skipped for ever if no subsequent 
mutations were made to partition par1.

-

Co-authored-by: Danny Chan 
---
 .../action/clean/CleanPlanActionExecutor.java  |   1 +
 .../hudi/table/action/clean/CleanPlanner.java  |   2 +-
 .../java/org/apache/hudi/table/TestCleaner.java| 183 -
 .../table/timeline/HoodieDefaultTimeline.java  |   7 +
 4 files changed, 148 insertions(+), 45 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
index ba7c71b1356..b494df42b49 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
@@ -111,6 +111,7 @@ public class CleanPlanActionExecutor extends 
BaseActionExecutor implements 
Serializable {
*/
   public Option getEarliestCommitToRetain() {
 return CleanerUtils.getEarliestCommitToRetain(
-hoodieTable.getMetaClient().getActiveTimeline().getCommitsTimeline(),
+
hoodieTable.getMetaClient().getActiveTimeline().getCommitsAndCompactionTimeline(),
 config.getCleanerPolicy(),
 config.getCleanerCommitsRetained(),
 Instant.now(),
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
index d1e77613691..17a12dcc7ff 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
@@ -25,6 +25,7 @@ import org.apache.hudi.avro.model.HoodieCleanerPlan;
 import org.apache.hudi.avro.model.HoodieRequestedReplaceMetadata;
 import org.apache.hudi.avro.model.HoodieRollbackMetadata;
 import org.apache.hudi.client.HoodieTimelineArchiver;
+import org.apache.hudi.client.SparkRDDReadClient;
 import org.apache.hudi.client.SparkRDDWriteClient;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
@@ -260,6 +261,97 @@ public class TestCleaner extends HoodieCleanerTestBase {
 }
   }
 
+  /**
+   * Test earliest commit to retain should be earlier than first pending 
compaction in incremental cleaning scenarios.
+   *
+   * @throws IOException
+   */
+  @Test
+  public void testEarliestInstantToRetainForPendingCompaction() throws 
IOException {
+HoodieWriteConfig writeConfig = getConfigBuilder().withPath(basePath)
+.withFileSystemViewConfig(new FileSystemViewStorageConfig.Builder()
+.withEnableBackupForRemoteFileSystemView(false)
+.build())
+.withCleanConfig(HoodieCleanConfig.newBuilder()
+.withAutoClean(false)
+
.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+.retainCommits(1)
+.build())
+.withCompactionConfig(HoodieCompactionConfig.newBuilder()
+.withInlineCompaction(false)
+.withMaxNumDeltaCommitsBeforeCompaction(1)
+.compactionSmallFileSize(1024 * 1024 * 1024)
+.build())
+.withArchivalConfig(HoodieArchivalConfig.newBuilder()
+.withAutoArchive(false)
+.archiveCommitsWith(2,3)
+.build())
+.withEmbeddedTimelineServerEnabled(false).build();
+
+

[GitHub] [hudi] danny0405 merged pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


danny0405 merged PR #9038:
URL: https://github.com/apache/hudi/pull/9038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9122:
URL: https://github.com/apache/hudi/pull/9122#issuecomment-1620942995

   
   ## CI report:
   
   * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18324)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1620943023

   
   ## CI report:
   
   * 7708ff75ba467e2156b6396ee2886ec645b7b44f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18325)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620942945

   
   ## CI report:
   
   * 984c3d691c3e7915fb1333ee823a641098774270 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318)
 
   * ef7585ba8d32d772500f31f95f3c04bfcac046e7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy closed pull request #9051: [HUDI-6436] Make the function of AlterHoodieTableChangeColumnCommand …

2023-07-04 Thread via GitHub


Zouxxyy closed pull request #9051: [HUDI-6436] Make the function of 
AlterHoodieTableChangeColumnCommand …
URL: https://github.com/apache/hudi/pull/9051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9122:
URL: https://github.com/apache/hudi/pull/9122#issuecomment-1620937689

   
   ## CI report:
   
   * 2f44e3cd97dbc108faabcdd5da0d805b1680e211 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9123:
URL: https://github.com/apache/hudi/pull/9123#issuecomment-1620937715

   
   ## CI report:
   
   * 7708ff75ba467e2156b6396ee2886ec645b7b44f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6479) Update release docs and quick start guide around INSERT_INTO default behavior change

2023-07-04 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6479:
-

 Summary: Update release docs and quick start guide around 
INSERT_INTO default behavior change 
 Key: HUDI-6479
 URL: https://issues.apache.org/jira/browse/HUDI-6479
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: sivabalan narayanan


With [this|https://github.com/apache/hudi/pull/9123] patch, we are also 
switching the default behavior with INSERT_INTO to use "insert" as the 
operation underneath. Until 0.13.1, default behavior was "upsert". In other 
words, if you ingest same batch of records in commit1 and in commit2, hudi will 
do an upsert and will return only the latest value with snapshot read. But with 
this patch, we are changing the default behavior to use "insert" as the name 
(INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and 
in commit2 will result in duplicates records with snapshot read. If users 
override the respective config, we will honor them, but the default behavior 
where none of the respective configs are overridden explicitly, will see a 
behavior change.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6478) Simplify INSERT_INTO configs

2023-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6478:
-
Labels: pull-request-available  (was: )

> Simplify INSERT_INTO configs
> 
>
> Key: HUDI-6478
> URL: https://issues.apache.org/jira/browse/HUDI-6478
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to 
> simplify them.
>  
> hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and 
> datasource.operation.type.
>  
> Rough notes:
>  
> hoodie.sql.bulk.insert.enable: true | false.
>  
> hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
> STRICT: we can't re-ingest same record again. will throw if found duplicates 
> to be ingested again.
> NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if 
> its enabled). if not, exception will be thrown.
> UPSERT: default insert.mode(until a week back where in we switch to make 
> bulk_insert the default for INSERT_INTO). will take care of de-dup. will use 
> OverwriteWithLatestAvroPayload(which means that we can update an existing 
> record across batches).
>  
> datasource.operation.type: insert, bulk_insert, upsert
>  
> drop.dups: Drop new incoming records if it already exists.
>  
> Proposal:
>  
>  * We will introduce a new config named "hoodie.sql.write.operation" which 
> will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will 
> be "insert" for INSERT_INTO.
>  ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
>  * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation 
> type is "Insert" for both spark-sql and spark-ds. This will maintain 
> duplicates but still help w/ small file management with "insert"s.
>  * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" 
> whose valid values are "ignore, fail and drop". Make "ignore" as default. 
> "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can 
> use the fail/STRICT behavior if need be.
>  ** Deprecate hoodie.datasource.insert.drop.dups.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan opened a new pull request, #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

2023-07-04 Thread via GitHub


nsivabalan opened a new pull request, #9123:
URL: https://github.com/apache/hudi/pull/9123

   ### Change Logs
   
   With the intent to simplify different config options with INSERT_INTO 
spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO 
like Operation type, insert mode, drop dupes, enable bulk insert configs. Here 
is what the simplification brings in. 
   
   ```
   - We will introduce a new config named "hoodie.sql.write.operation" which 
will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will 
be "insert" for INSERT_INTO.
- Deprecate hoodie.sql.insert.mode and 
"hoodie.sql.bulk.insert.enable".
- Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if 
operation type is "Insert" for both spark-sql and spark-ds. This will maintain 
duplicates but still help w/ small file management with "insert"s.
   - Introduce a new config named "hoodie.datasource.insert.dedupe.policy" 
whose valid values are "ignore, fail and drop". Make "ignore" as default. 
"fail" will mimic "STRICT" mode we support as of now. 
- Deprecate hoodie.datasource.insert.drop.dups.
   ```
   
   When both old and new configs are set, new config will take effect. 
   When only new configs are set, new config will take effect. 
   When neither is set, new configs and their default will take effect. 
   When only old configs are set, old configs will take effect. Please do note 
that we are deprecating the use of these old configs. In 2 releases, we will 
completely remove these configs. So, would recommend users to migrate to new 
configs. 
   
   Note: old refers to "hoodie.sql.insert.mode" and new config refers to 
"hoodie.sql.write.operation".
   
   Behavior change: 
   With this patch, we are also switching the default behavior with INSERT_INTO 
to use "insert" as the operation underneath. Until 0.13.1, default behavior was 
"upsert". In other words, if you ingest same batch of records in commit1 and in 
commit2, hudi will do an upsert and will return only the latest value with 
snapshot read. But with this patch, we are changing the default behavior to use 
"insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of 
records in commit1 and in commit2 will result in duplicates records with 
snapshot read. If users override the respective config, we will honor them, but 
the default behavior where none of the respective configs are overridden 
explicitly, will see a behavior change. 
   
   ### Impact
   
   Usability will be improved for spark-sql users as we have deprecated few 
confusing configs and tried to align with spark datasource writes. Also, this 
brings in a behavior change as well. With this patch, we are also switching the 
default behavior with INSERT_INTO to use "insert" as the operation underneath. 
Until 0.13.1, default behavior was "upsert". In other words, if you ingest same 
batch of records in commit1 and in commit2, hudi will do an upsert and will 
return only the latest value with snapshot read. But with this patch, we are 
changing the default behavior to use "insert" as the name (INSERT_INTO) 
signifies. So, ingesting the same batch of records in commit1 and in commit2 
will result in duplicates records with snapshot read. If users override the 
respective config, we will honor them, but the default behavior where none of 
the respective configs are overridden explicitly, will see a behavior change. 
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   We will have to call out the behavior change as part of our release docs and 
also update our quick start guide around the same. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6478) Simplify INSERT_INTO configs

2023-07-04 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6478:
-

 Summary: Simplify INSERT_INTO configs
 Key: HUDI-6478
 URL: https://issues.apache.org/jira/browse/HUDI-6478
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: sivabalan narayanan


We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to 
simplify them.
 
hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and 
datasource.operation.type.
 
Rough notes:
 
hoodie.sql.bulk.insert.enable: true | false.
 
hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
STRICT: we can't re-ingest same record again. will throw if found duplicates to 
be ingested again.
NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if its 
enabled). if not, exception will be thrown.
UPSERT: default insert.mode(until a week back where in we switch to make 
bulk_insert the default for INSERT_INTO). will take care of de-dup. will use 
OverwriteWithLatestAvroPayload(which means that we can update an existing 
record across batches).
 
datasource.operation.type: insert, bulk_insert, upsert
 
drop.dups: Drop new incoming records if it already exists.
 
Proposal:
 
 * We will introduce a new config named "hoodie.sql.write.operation" which will 
have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be 
"insert" for INSERT_INTO.
 ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
 * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation 
type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates 
but still help w/ small file management with "insert"s.
 * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose 
valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will 
mimic "STRICT" mode we support as of now. Even spark-ds users can use the 
fail/STRICT behavior if need be.
 ** Deprecate hoodie.datasource.insert.drop.dups.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6477) Lazy fetching partition path & file slice when refresh in HoodieFileIndex

2023-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6477:
-
Labels: pull-request-available  (was: )

> Lazy fetching partition path & file slice when refresh in HoodieFileIndex
> -
>
> Key: HUDI-6477
> URL: https://issues.apache.org/jira/browse/HUDI-6477
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zouxxyy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] Zouxxyy opened a new pull request, #9122: [HUDI-6477] Lazy fetching partition path & file slice when refresh in…

2023-07-04 Thread via GitHub


Zouxxyy opened a new pull request, #9122:
URL: https://github.com/apache/hudi/pull/9122

   … HoodieFileIndex
   
   ### Change Logs
   
   Currently there is a lazy list mechanism in `hoodieFileIndex`, but it only 
takes effect during initialization. We can make it take effect when refresh. At 
present, almost all spark commands in HUDI will do refresh, such as DDL alter 
table operation, we don’t need to do list file at all
   
   ### Impact
   
   Lazy fetching partition path & file slice when refresh in HoodieFileIndex
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6477) Lazy fetching partition path & file slice when refresh in HoodieFileIndex

2023-07-04 Thread zouxxyy (Jira)
zouxxyy created HUDI-6477:
-

 Summary: Lazy fetching partition path & file slice when refresh in 
HoodieFileIndex
 Key: HUDI-6477
 URL: https://issues.apache.org/jira/browse/HUDI-6477
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: zouxxyy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] boneanxs commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-04 Thread via GitHub


boneanxs commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620905284

   You can still use dynamic partition, in this way:
   
   ```sql
   insert overwrite hudi_cow_pt_tbl partition(dt, hh) select 13, 'a13', 1100,  
'2021-12-09', '12'
   ```
   
   the main point is that do we consider `insert overwrite hudi_cow_pt_tbl 
select 13, 'a13', 1100,  '2021-12-09', '12'` is a dynamic partition writing? I 
think @leesf 's view makes sense,  
https://github.com/apache/hudi/pull/7365#issuecomment-1343707001
   
   > @nsivabalan hi, here are my two cents: insert overwrite xxx values(xx,xxx) 
has very clear semantics, it means overwrite the entire table, insert overwrite 
xx partition(xx) values(xx,xxx) means insert overwrite partitions, but hudi 
handles overwrite partitions for overwrite table, which is a definite bug and i 
do not think we need to introduce a new operation for it.
   
   Also, this change can also keep the consistent behavior with spark sql, 
`insert overwrite hudi_cow_pt_tbl select 13, 'a13', 1100,  '2021-12-09', '12'` 
will overwrite the whole table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620904971

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 1d32092354e9065499631ed860a09a9c918c088d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18323)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9006:
URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620904718

   
   ## CI report:
   
   * b385ea4a4d4b7986ba27f5df352686652dc53c36 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18322)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-04 Thread via GitHub


flashJd commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620903128

   As we need the capacity to insert overwrite the whole partitioned table, why 
not use the config to enable it and make semantics forward compatible, 
meanwhile not lose the dynamic partition capacity


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-04 Thread via GitHub


flashJd commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620899951

   > @flashJd I noticed this issue before. Yes, this is a behavior change for 
`INSERT_OVERWRITE` without partition columns after #7365, but I think it's the 
right modification? if users don't specify partition columns, we'll consider it 
wants to overwrite all table?
   > 
   > Spark sql also does the same way. i.e. `insert overwrite table_name 
values( #specify partition values)` will overwrite whole table.
   
   1) Capcity of insert overwrite partitioned table with dynamic partition 
lost, we can only use the  grammer `insert overwrite hudi_cow_pt_tbl 
partition(dt = '2021-12-09', hh='12') select 13, 'a13', 1100` now.
   2)  Insert overwrite semantics not  forward compatible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


Zouxxyy commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620893285

   > It's greate if we can add a simple test case.
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620864476

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18321)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9066: [HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs

2023-07-04 Thread via GitHub


yihua commented on code in PR #9066:
URL: https://github.com/apache/hudi/pull/9066#discussion_r1252417782


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java:
##
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.realtime;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieRecordSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils;
+import org.apache.hudi.io.storage.HoodieFileReader;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.JobConf;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED;
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.COMPACTION_LAZY_BLOCK_READ_ENABLED_PROP;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_COMPACTION_LAZY_BLOCK_READ_ENABLED;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_MAX_DFS_STREAM_BUFFER_SIZE;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.DEFAULT_SPILLABLE_MAP_BASE_PATH;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.MAX_DFS_STREAM_BUFFER_SIZE_PROP;
+import static 
org.apache.hudi.hadoop.config.HoodieRealtimeConfig.SPILLABLE_MAP_BASE_PATH_PROP;
+import static 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getBaseFileReader;
+import static 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes;
+import static 
org.apache.hudi.internal.schema.InternalSchema.getEmptyInternalSchema;
+
+public class HoodieMergeOnReadSnapshotReader extends 
AbstractRealtimeRecordReader implements Iterator, AutoCloseable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieMergeOnReadSnapshotReader.class);
+
+  private final String tableBasePath;
+  private final List logFilePaths;
+  private final String latestInstantTime;
+  private final Schema readerSchema;
+  private final JobConf jobConf;
+  private final HoodieMergedLogRecordScanner logRecordScanner;
+  private final HoodieFileReader baseFileReader;
+  private final Map logRecordsByKey;
+  private final Iterator recordsIterator;
+  private final ExternalSpillableMap mergedRecordsByKey;
+
+  public HoodieMergeOnReadSnapshotReader(String tableBasePath, String 
baseFilePath,
+ List logFilePaths,
+ String latestInstantTime,
+ Schema readerSchema,
+ JobConf jobConf, long start, long 
length, String[] hosts) throws IOException {
+super(getRealtimeSplit(tableBasePath, baseFilePath, logFilePaths, 
latestInstantTime, start, length, hosts), jobConf);
+this.tableBasePath = tableBasePath;
+this.logFilePaths = logFilePaths;
+this.latestInstantTime = latestInstantTime;
+this.readerSchema = readerSchema;
+this.jobConf = jobConf;
+HoodieTimer timer = new HoodieTimer().startTimer();
+this.logRecordScanner = getMergedLogRecordScanner();
+LOG.debug("Time taken to scan log records: {}", 

[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620837762

   
   ## CI report:
   
   * 984c3d691c3e7915fb1333ee823a641098774270 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620837706

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278)
 
   * 1d32092354e9065499631ed860a09a9c918c088d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18323)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1620833117

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278)
 
   * 1d32092354e9065499631ed860a09a9c918c088d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-04 Thread via GitHub


nsivabalan commented on code in PR #9106:
URL: https://github.com/apache/hudi/pull/9106#discussion_r1252401000


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -116,11 +134,10 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 // Below config is only used if isLogCompactionEnabled is set.
 
.withLogCompactionBlocksThreshold(writeConfig.getMetadataLogCompactBlocksThreshold())
 .build())
-.withParallelism(parallelism, parallelism)
-.withDeleteParallelism(parallelism)
-.withRollbackParallelism(parallelism)
-.withFinalizeWriteParallelism(parallelism)
-.withAllowMultiWriteOnSameInstant(true)
+
.withStorageConfig(HoodieStorageConfig.newBuilder().hfileMaxFileSize(maxHFileSizeBytes)
+
.logFileMaxSize(maxLogFileSizeBytes).logFileDataBlockMaxSize(maxLogBlockSizeBytes).build())
+.withRollbackParallelism(defaultParallelism)
+.withFinalizeWriteParallelism(defaultParallelism)

Review Comment:
   did you remove .withAllowMultiWriteOnSameInstant(true) intentionally ? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -91,8 +108,9 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 .withCleanConfig(HoodieCleanConfig.newBuilder()
 .withAsyncClean(DEFAULT_METADATA_ASYNC_CLEAN)
 .withAutoClean(false)
-.withCleanerParallelism(parallelism)
-.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+.withCleanerParallelism(defaultParallelism)
+.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS)
+.retainFileVersions(2)

Review Comment:
   I understand it could be a larger change, but file versions makes sense in 
general. If uber has been running w/ file versions for 6+ months, we should do 
a round of testing on our end, and can possibly proceed.
but incremental cleaning may not kick in. so, for large MDTs, wondering 
will there be any latency hit 



##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -341,7 +341,11 @@ private void ensurePartitionsLoadedCorrectly(List 
partitionList) {
 long beginTs = System.currentTimeMillis();
 // Not loaded yet
 try {
-  LOG.info("Building file system view for partitions " + partitionSet);
+  if (partitionSet.size() < 100) {
+LOG.info("Building file system view for partitions: " + 
partitionSet);

Review Comment:
   yes, may be we should reconsider the freq of logging here. for eg, log every 
every 100 partitions or something. not sure we will gain much by logging this 
for every partition. 



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -537,7 +538,8 @@ public HoodieTableMetaClient getMetadataMetaClient() {
   }
 
   public Map stats() {
-return metrics.map(m -> m.getStats(true, metadataMetaClient, 
this)).orElse(new HashMap<>());
+Set allMetadataPartitionPaths = 
Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
+return metrics.map(m -> m.getStats(true, metadataMetaClient, this, 
allMetadataPartitionPaths)).orElse(new HashMap<>());

Review Comment:
   HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient 
metaClient, HoodieTableMetadata metadata) 
   
   reloads the timeline. 
   can we move the reload to outside of the caller so that we don't reload for 
every MDT partition stats



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -176,7 +176,7 @@ private void initMetadataReader() {
 }
 
 try {
-  this.metadata = new HoodieBackedTableMetadata(engineContext, 
dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath());
+  this.metadata = new HoodieBackedTableMetadata(engineContext, 
dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath(), true);

Review Comment:
   rational is that, metadata writer itself is short lived just for committing 
one instant and so we should be good to enable re-use here? 
   do we even expect to see any improvement here, since this is meant just for 
one write to MDT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9006:
URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620806187

   
   ## CI report:
   
   * 775343a4b7c9d72e3476ddee84078883af27f01e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18153)
 
   * b385ea4a4d4b7986ba27f5df352686652dc53c36 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18322)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9006: [HUDI-6404] Implement ParquetToolsExecutionStrategy for clustering

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9006:
URL: https://github.com/apache/hudi/pull/9006#issuecomment-1620801120

   
   ## CI report:
   
   * 775343a4b7c9d72e3476ddee84078883af27f01e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18153)
 
   * b385ea4a4d4b7986ba27f5df352686652dc53c36 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9105:
URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620798243

   
   ## CI report:
   
   * fad064d3590670a75b8f68c5eca91e059d235241 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18317)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620798158

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   * 43c37c8a48763d8fdf71937fab4ccb900b313385 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18315)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9121:
URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620766916

   
   ## CI report:
   
   * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18314)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-04 Thread via GitHub


hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1620722497

   
   ## CI report:
   
   * e6568126aab0b098ccaac59e137e902d7a1070c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18309)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620714045

   
   ## CI report:
   
   * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620665909

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320)
 
   * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18321)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620658010

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319)
 
   * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320)
 
   * cb6fd2a6af75b79129b86a56f02a4566e2fe4e4f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620652006

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319)
 
   * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620622515

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316)
 
   * 035aa770c2fdeb9dcd9e91097f41904d39bca70f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18319)
 
   * 7898678550ef22db9e564d5a4bef2b7845e6b5e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620617639

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * 7f04db759666f31a92888564d16216943674ac5b Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312)
 
   * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316)
 
   * 035aa770c2fdeb9dcd9e91097f41904d39bca70f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9064:
URL: https://github.com/apache/hudi/pull/9064#issuecomment-1620617516

   
   ## CI report:
   
   * b8418b74febf4551c0f79c7ebe71cf24916124e6 UNKNOWN
   * 3e0876320ac294a7da6c81a8b26630ed518606cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18307)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] GallonREX commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-07-04 Thread via GitHub


GallonREX commented on issue #7925:
URL: https://github.com/apache/hudi/issues/7925#issuecomment-1620580148

   这是自动回复。谢谢您的邮件,您的邮件我已收到,我将尽快回复您。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-07-04 Thread via GitHub


ad1happy2go commented on issue #7925:
URL: https://github.com/apache/hudi/issues/7925#issuecomment-1620579931

   @GallonREX The error what you getting `Cannot resolve conflicts for 
overlapping writes` is normally comes when you try to update the same file 
group concurrently. This should not be depending on versions. Even 0.12 should 
fail if multiple writers try to write in same file group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8796: [HUDI-6129] Support rate limit for Spark streaming source

2023-07-04 Thread via GitHub


hudi-bot commented on PR #8796:
URL: https://github.com/apache/hudi/pull/8796#issuecomment-1620576704

   
   ## CI report:
   
   * 6c568f15e26e072d07cdb5de7e7a39fa2b9fbc6f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18308)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9120: [HUDI-6475] Optimize TableNotFoundException message

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9120:
URL: https://github.com/apache/hudi/pull/9120#issuecomment-1620571410

   
   ## CI report:
   
   * ac6f163af4a9ab33b78a9304b25babc7caa90714 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18306)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9118: [HUDI-2141] Support flink write metrics

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1620520120

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * 6127808e39fcbf9e2acae98666887a455e0e926e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18304)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620477542

   
   ## CI report:
   
   * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298)
 
   * 984c3d691c3e7915fb1333ee823a641098774270 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18318)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620476985

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311)
 
   * 43c37c8a48763d8fdf71937fab4ccb900b313385 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18315)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1620477463

   
   ## CI report:
   
   * 72e9fc345a516c34387ba34d5fde2f8ea631b404 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18303)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9105:
URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620477346

   
   ## CI report:
   
   * fad064d3590670a75b8f68c5eca91e059d235241 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18317)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620477273

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * 7f04db759666f31a92888564d16216943674ac5b Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312)
 
   * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18316)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620467909

   
   ## CI report:
   
   * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298)
 
   * 984c3d691c3e7915fb1333ee823a641098774270 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9105: [HUDI-6459] Add Rollback and multi-writer tests for Record Level Index

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9105:
URL: https://github.com/apache/hudi/pull/9105#issuecomment-1620467764

   
   ## CI report:
   
   * fad064d3590670a75b8f68c5eca91e059d235241 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620467693

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217)
 
   * 7f04db759666f31a92888564d16216943674ac5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312)
 
   * c37cc8fa71f68c1088ac1d06fbe34635776f1e14 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620467397

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310)
 
   * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311)
 
   * 43c37c8a48763d8fdf71937fab4ccb900b313385 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9121:
URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620456769

   
   ## CI report:
   
   * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18314)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620456654

   
   ## CI report:
   
   * 2a046240c1e7c0a18f9b57c0845298ea65b72951 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18269)
 
   * 5b52b7900c734adba70ac16da20bdc23f21b01d0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18313)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620456458

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217)
 
   * 7f04db759666f31a92888564d16216943674ac5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18312)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BBency commented on issue #9094: Async Clustering failing with errors for MOR table

2023-07-04 Thread via GitHub


BBency commented on issue #9094:
URL: https://github.com/apache/hudi/issues/9094#issuecomment-1620436262

   Approach 1:
   
![image](https://github.com/apache/hudi/assets/118782050/ddd0627a-3909-4237-bbca-89965860ebb0)
   
   Approach 2:
   
![image](https://github.com/apache/hudi/assets/118782050/9cb6bdde-4ba2-4dc7-82dd-1bc674943da1)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Alowator commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT

2023-07-04 Thread via GitHub


Alowator commented on PR #9112:
URL: https://github.com/apache/hudi/pull/9112#issuecomment-1620428039

   If there is no any suggestions or questions, it could be merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


zhuanshenbsj1 commented on code in PR #9038:
URL: https://github.com/apache/hudi/pull/9038#discussion_r1252072216


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -111,14 +111,15 @@ HoodieCleanerPlan requestClean(HoodieEngineContext 
context) {
 LOG.info("Nothing to clean here. It is already clean");
 return 
HoodieCleanerPlan.newBuilder().setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()).build();
   }
-  LOG.info("Total Partitions to clean : " + partitionsToClean.size() + ", 
with policy " + config.getCleanerPolicy());
+  LOG.info("Earliest commit to retain for clean : " + 
(earliestInstant.isPresent() ? earliestInstant.get().getTimestamp() : "null"));
+  LOG.info("Total partitions to clean : " + partitionsToClean.size() + ", 
with policy " + config.getCleanerPolicy());
   int cleanerParallelism = Math.min(partitionsToClean.size(), 
config.getCleanerParallelism());
   LOG.info("Using cleanerParallelism: " + cleanerParallelism);
 
   context.setJobStatus(this.getClass().getSimpleName(), "Generating list 
of file slices to be cleaned: " + config.getTableName());
 
   Map>> 
cleanOpsWithPartitionMeta = context
-  .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), 
cleanerParallelism)
+  .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean, 
earliestInstant)), cleanerParallelism)

Review Comment:
   Before modifying the earliestCommitToRetain, it was calculated twice.
   
   
![image](https://github.com/apache/hudi/assets/34104400/38b9d3bb-53bd-46c6-af9f-ebc40fce1605)
   
Due to non atomic operation,  it is possible that the partition level 
calculation and the outer function calculation results are inconsistent,  which 
can result in partition level cleaning exceeding the outer 
earliestCommitToRetain.
   
   This may causes incorrect reading results for snapshot read in 
activetimeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9121:
URL: https://github.com/apache/hudi/pull/9121#issuecomment-1620397104

   
   ## CI report:
   
   * 8555b51e9fa8f7ec9096df39d11e81d8b5177015 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620396976

   
   ## CI report:
   
   * 2a046240c1e7c0a18f9b57c0845298ea65b72951 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18269)
 
   * 5b52b7900c734adba70ac16da20bdc23f21b01d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9095: Test ci

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9095:
URL: https://github.com/apache/hudi/pull/9095#issuecomment-1620396794

   
   ## CI report:
   
   * 99475ffc62972ee49905fca98ea70f2096cfb135 UNKNOWN
   * ec568a0c309690a1b0931249aae1e4aab9eddc9b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18217)
 
   * 7f04db759666f31a92888564d16216943674ac5b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620396492

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310)
 
   * 5b354dd07b4381c270e17001a1010141bf7086e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18311)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1620380794

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * 59d464a6e1f7a69ba0d0ab331ad01e3ed66f8e62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18310)
 
   * 5b354dd07b4381c270e17001a1010141bf7086e8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-04 Thread via GitHub


codope commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1620378064

   > Hi @jonvex Can you elaborate a little more why to revert the changes?
   
   @danny0405 This reverts part of #8875 i.e. revert the behavior change of 
spark-sql insert into using bulk insert. So, with this revert, it will be back 
to upsert. But, we plan to add some new configs and deprecate the existing sql 
insert mode config. I've fixed all the test failures. We can land this once the 
CI is green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6476) Improve the performance of getAllPartitionPaths

2023-07-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6476:
-
Labels: pull-request-available  (was: )

> Improve the performance of getAllPartitionPaths
> ---
>
> Key: HUDI-6476
> URL: https://issues.apache.org/jira/browse/HUDI-6476
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: hudi-utilities
>Reporter: Wechar
>Priority: Major
>  Labels: pull-request-available
> Attachments: After improvement.png, Before improvement.png
>
>
> Currently Hudi will list all status of files in hudi table directory, which 
> can be avoid to improve the performance of getAllPartitionPaths, especially 
> for the non-partitioned table with many files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] wecharyu opened a new pull request, #9121: [HUDI-6476] Improve the performance of getAllPartitionPaths

2023-07-04 Thread via GitHub


wecharyu opened a new pull request, #9121:
URL: https://github.com/apache/hudi/pull/9121

   ### Change Logs
   
   Currently Hudi will list all status of files in hudi table directory, which 
can be avoid to improve the performance of getAllPartitionPaths, especially for 
the non-partitioned table with many files. What we change in this patch:
   
   - reduce a stage in `getPartitionPathWithPathPrefix()`
   - only check directory to find the PartitionMetadata 
   
   ### Impact
   
   Performance improvement.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-04 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1620367087

   
   ## CI report:
   
   * df41145f4bfa32fbd1f705cd6d04b74a93a0747a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18298)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >