[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9416:
URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674291201

   
   ## CI report:
   
   * 642c6dd967978781d41b74138f89fae26192056b Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leosanqing commented on a diff in pull request #9297: Generate test jars for hudi-utilities and hudi-hive-sync modules

2023-08-10 Thread via GitHub


leosanqing commented on code in PR #9297:
URL: https://github.com/apache/hudi/pull/9297#discussion_r1290962462


##
hudi-sync/hudi-hive-sync/pom.xml:
##
@@ -200,6 +200,9 @@
 
   
 
+
+  false
+

Review Comment:
   > Weird, I can not reproduce it, maybe it is because of your local mvn 
repository env.
   
   hello, I also encountered this problem. when I use this command to compile 
project.  
   `mvn clean install -DskipTests -Dscala-2.12 -Dspark3.2 
-Dmaven.test.skip=true -Dcheckstyle.skip=true -Dflink1.16 -Drat.skip=true`
   
   `[ERROR] Failed to execute goal on project hudi-utilities_2.12: Could not 
resolve dependencies for project 
org.apache.hudi:hudi-utilities_2.12:jar:0.15.0-SNAPSHOT: 
org.apache.hudi:hudi-hive-sync:jar:tests:0.15.0-SNAPSHOT was not found in 
https://packages.confluent.io/maven/ during a previous attempt. This failure 
was cached in the local repository and resolution is not reattempted until the 
update interval of confluent has elapsed or updates are forced -> [Help 1]
   `
   
   This test jar, I don't know how to generate it.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down

2023-08-10 Thread via GitHub


SteNicholas commented on code in PR #8437:
URL: https://github.com/apache/hudi/pull/8437#discussion_r1290959520


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java:
##
@@ -80,12 +104,42 @@ public static ClosableIterator 
getParquetRecordIterator(
   batchSize,
   path,
   splitStart,
-  splitLength));
+  splitLength,
+  filterPredicate,
+  recordFilter));
   if (castProjection.isPresent()) {
 return new SchemaEvolvedRecordIterator(itr, castProjection.get());
   } else {
 return itr;
   }
 }
   }
+
+  private static FilterPredicate getFilterPredicate(Configuration 
configuration) {
+try {
+  return SerializationUtil.readObjectFromConfAsBase64(FILTER_PREDICATE, 
configuration);
+} catch (IOException e) {

Review Comment:
   @danny0405, the filters could be passed to Hadoop's configuration entries 
prefixed with `FILTER_PREDICATE` that is 
`parquet.private.read.filter.predicate` for `HoodieTableSource#getParquetConf` 
and used by one of the available readers, `VectorizedParquetRecordReader` or 
`ParquetRecordReader`. Meanwhile, `UNBOUND_RECORD_FILTER` which is 
`parquet.read.filter` is used for native parquet read filter cofiguration.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive

2023-08-10 Thread via GitHub


Zouxxyy commented on PR #9416:
URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674276179

   @suryaprasanna @yihua @prashantwason can you help with a revew~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674256334

   
   ## CI report:
   
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264)
 
   * 482f63ffe2df3fbaf0176a175b530082e0f31154 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19265)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674250531

   
   ## CI report:
   
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262)
 
   * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264)
 
   * 482f63ffe2df3fbaf0176a175b530082e0f31154 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674220535

   
   ## CI report:
   
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262)
 
   * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674216082

   
   ## CI report:
   
   * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662)
 
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262)
 
   * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6684) Follow up/ fix missing records from bloom filter partition in MDT

2023-08-10 Thread Sagar Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753046#comment-17753046
 ] 

Sagar Sumit commented on HUDI-6684:
---

Let's think about when this could happen, but if it is missing then why not 
simply add it?

> Follow up/ fix missing records from bloom filter partition in MDT
> -
>
> Key: HUDI-6684
> URL: https://issues.apache.org/jira/browse/HUDI-6684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> As of now, if a bloom filter for a file is missing from bloom filter 
> partition in MDT, we ignore it. 
> HoodieMetadataTableUtil
> {code:java}
>   // If reading the bloom filter failed then do not add a record for this file
>   if (bloomFilterBuffer == null) {
> LOG.error("Failed to read bloom filter from " + addedFilePath);
> return Stream.empty().iterator();
>   }
> } {code}
> we should think about on what scenario, this is possible and how exactly we 
> can handle such situations. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] codope commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


codope commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1290894426


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -848,64 +851,49 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams,
   
String instantTime) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
-
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
-
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);
+partitionToDeletedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new 
Tuple3<>(entry.getKey(), deletedFile, true)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+partitionToAppendedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> 
new Tuple3<>(entry.getKey(), addedFile, false)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+
+// Create records MDT
+int parallelism = Math.max(Math.min(partitionFileFlagTupleList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
+return engineContext.parallelize(partitionFileFlagTupleList, 
parallelism).flatMap(partitionFileFlagTuple -> {
+  final String partitionName = partitionFileFlagTuple._1();
+  final String filename = partitionFileFlagTuple._2();
+  final boolean isDeleted = partitionFileFlagTuple._3();
+  if (!FSUtils.isBaseFile(new Path(filename))) {
+LOG.warn(String.format("Ignoring file %s as it is not a base file", 
filename));
+return Stream.empty().iterator();
+  }
 
-List>> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet()
-.stream().map(entry -> Pair.of(entry.getKey(), 
entry.getValue())).collect(Collectors.toList());
-parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToAppendedFilesRDD = 
engineContext.parallelize(partitionToAppendedFilesList, parallelism);
+  // Read the bloom filter from the base file if the file is being added
+  ByteBuffer bloomFilterBuffer = ByteBuffer.allocate(0);
+  if (!isDeleted) {
+final String pathWithPartition = partitionName + "/" + filename;
+final Path addedFilePath = new 
Path(recordsGenerationParams.getDataMetaClient().getBasePath(), 
pathWithPartition);
+bloomFilterBuffer = 
readBloomFilter(recordsGenerationParams.getDataMetaClient().getHadoopConf(), 
addedFilePath);
+
+// If reading the bloom filter failed then do not add a record for 
this file
+if (bloomFilterBuffer == null) {
+  LOG.error("Failed to read bloom filter from " + addedFilePath);
+  return Stream.empty().iterator();

Review Comment:
   why not simply add to bloom?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, ple

[jira] [Closed] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions

2023-08-10 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain closed HUDI-6677.
-
Resolution: Not A Problem

> Make HoodieRecordIndexInfo schema compatible with older versions
> 
>
> Key: HUDI-6677
> URL: https://issues.apache.org/jira/browse/HUDI-6677
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently the metadata payload schema for record index can cause schema 
> evolution issues for existing hudi tables. The Jira aims to fix these issues. 
> There are two schema evolution issues -:
> 1. The field name has changed from partition to partitionName.
> 2. Also we have added a new field fileId in between a nested schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] lokeshj1703 closed pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo schema compatible with older versions

2023-08-10 Thread via GitHub


lokeshj1703 closed pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo 
schema compatible with older versions
URL: https://github.com/apache/hudi/pull/9415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674190791

   
   ## CI report:
   
   * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674190580

   
   ## CI report:
   
   * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662)
 
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9416:
URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674190763

   
   ## CI report:
   
   * 3792e6de4fbf7642011c3d723f8e514f89c991ae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19248)
 
   * 642c6dd967978781d41b74138f89fae26192056b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9416:
URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674185958

   
   ## CI report:
   
   * 3792e6de4fbf7642011c3d723f8e514f89c991ae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19248)
 
   * 642c6dd967978781d41b74138f89fae26192056b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674185739

   
   ## CI report:
   
   * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662)
 
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674181982

   
   ## CI report:
   
   * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662)
 
   * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


nsivabalan commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674179663

   @codope : addressed all feedback. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


nsivabalan commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863005


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -848,64 +851,49 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams,
   
String instantTime) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
-
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
-
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);
+partitionToDeletedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new 
Tuple3<>(entry.getKey(), deletedFile, true)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+partitionToAppendedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> 
new Tuple3<>(entry.getKey(), addedFile, false)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));

Review Comment:
   there are some minor difference b/w col stats and bloom filter wrt log file 
handling. So, may be we can leave it as is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6684) Follow up/ fix missing records from bloom filter partition in MDT

2023-08-10 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6684:
-

 Summary: Follow up/ fix missing records from bloom filter 
partition in MDT
 Key: HUDI-6684
 URL: https://issues.apache.org/jira/browse/HUDI-6684
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


As of now, if a bloom filter for a file is missing from bloom filter partition 
in MDT, we ignore it. 

HoodieMetadataTableUtil
{code:java}
  // If reading the bloom filter failed then do not add a record for this file
  if (bloomFilterBuffer == null) {
LOG.error("Failed to read bloom filter from " + addedFilePath);
return Stream.empty().iterator();
  }
} {code}
we should think about on what scenario, this is possible and how exactly we can 
handle such situations. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


nsivabalan commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863860


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -848,64 +851,49 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams,
   
String instantTime) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
-
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
-
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);
+partitionToDeletedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new 
Tuple3<>(entry.getKey(), deletedFile, true)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+partitionToAppendedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> 
new Tuple3<>(entry.getKey(), addedFile, false)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+
+// Create records MDT
+int parallelism = Math.max(Math.min(partitionFileFlagTupleList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
+return engineContext.parallelize(partitionFileFlagTupleList, 
parallelism).flatMap(partitionFileFlagTuple -> {
+  final String partitionName = partitionFileFlagTuple._1();
+  final String filename = partitionFileFlagTuple._2();
+  final boolean isDeleted = partitionFileFlagTuple._3();
+  if (!FSUtils.isBaseFile(new Path(filename))) {
+LOG.warn(String.format("Ignoring file %s as it is not a base file", 
filename));
+return Stream.empty().iterator();
+  }
 
-List>> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet()
-.stream().map(entry -> Pair.of(entry.getKey(), 
entry.getValue())).collect(Collectors.toList());
-parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToAppendedFilesRDD = 
engineContext.parallelize(partitionToAppendedFilesList, parallelism);
+  // Read the bloom filter from the base file if the file is being added
+  ByteBuffer bloomFilterBuffer = ByteBuffer.allocate(0);
+  if (!isDeleted) {
+final String pathWithPartition = partitionName + "/" + filename;
+final Path addedFilePath = new 
Path(recordsGenerationParams.getDataMetaClient().getBasePath(), 
pathWithPartition);
+bloomFilterBuffer = 
readBloomFilter(recordsGenerationParams.getDataMetaClient().getHadoopConf(), 
addedFilePath);
+
+// If reading the bloom filter failed then do not add a record for 
this file
+if (bloomFilterBuffer == null) {
+  LOG.error("Failed to read bloom filter from " + addedFilePath);
+  return Stream.empty().iterator();

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-6684



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries 

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


nsivabalan commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863005


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -848,64 +851,49 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams,
   
String instantTime) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
-
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
-
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);
+partitionToDeletedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new 
Tuple3<>(entry.getKey(), deletedFile, true)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+partitionToAppendedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> 
new Tuple3<>(entry.getKey(), addedFile, false)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));

Review Comment:
   there are some minor difference b/w col stats and bloom filter wrt log file 
handling. So, may be we can leave it as is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-10 Thread via GitHub


nsivabalan commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1290861309


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -915,65 +903,60 @@ public static HoodieData 
convertFilesToColumnStatsRecords(HoodieEn
   
Map> partitionToDeletedFiles,
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
+// Find the columns to index
 HoodieTableMetaClient dataTableMetaClient = 
recordsGenerationParams.getDataMetaClient();
-
 final List columnsToIndex =
 getColumnsToIndex(recordsGenerationParams,
 Lazy.lazily(() -> tryResolveSchemaForTable(dataTableMetaClient)));
-
 if (columnsToIndex.isEmpty()) {
   // In case there are no columns to index, bail
   return engineContext.emptyHoodieData();
 }
 
-final List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream()
-.map(e -> Pair.of(e.getKey(), e.getValue()))
-.collect(Collectors.toList());
-
-int deletedFilesTargetParallelism = 
Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getColumnStatsIndexParallelism()), 1);
-final HoodieData>> partitionToDeletedFilesRDD =
-engineContext.parallelize(partitionToDeletedFilesList, 
deletedFilesTargetParallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionPath = partitionToDeletedFilesPair.getLeft();
-  final String partitionId = getPartitionIdentifier(partitionPath);
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-
-  return deletedFileList.stream().flatMap(deletedFile -> {
-final String filePathWithPartition = partitionPath + "/" + deletedFile;
-return getColumnStatsRecords(partitionId, filePathWithPartition, 
dataTableMetaClient, columnsToIndex, true);
-  }).iterator();
-});
-
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
-
-final List>> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet().stream()
-.map(entry -> Pair.of(entry.getKey(), entry.getValue()))
-.collect(Collectors.toList());
-
-int appendedFilesTargetParallelism = 
Math.max(Math.min(partitionToAppendedFilesList.size(), 
recordsGenerationParams.getColumnStatsIndexParallelism()), 1);
-final HoodieData>> 
partitionToAppendedFilesRDD =
-engineContext.parallelize(partitionToAppendedFilesList, 
appendedFilesTargetParallelism);
-
-HoodieData appendedFilesRecordsRDD = 
partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesPair -> {
-  final String partitionPath = partitionToAppendedFilesPair.getLeft();
-  final String partitionId = getPartitionIdentifier(partitionPath);
-  final Map appendedFileMap = 
partitionToAppendedFilesPair.getRight();
+LOG.info(String.format("Indexing %d columns for column stats index", 
columnsToIndex.size()));
+
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);

Review Comment:
   we do N * M. where N = columns to index. and M = tuple (partition, filename, 
isDeleted). So, we don't need it here. you can check this method 
   getColumnStatsRecords(partitionId, filePathWithPartition, 
dataTableMetaClient, columnsToIndex, isDeleted).iterator();
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6670] Fix timeline check in metadata table validator (#9405)

2023-08-10 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 638e52d90ed [HUDI-6670] Fix timeline check in metadata table validator 
(#9405)
638e52d90ed is described below

commit 638e52d90eda2d7c1e78a87f08427e5e3bf0a46c
Author: Y Ethan Guo 
AuthorDate: Thu Aug 10 20:29:36 2023 -0700

[HUDI-6670] Fix timeline check in metadata table validator (#9405)
---
 .../org/apache/hudi/utilities/HoodieMetadataTableValidator.java | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
index d79957c735f..29e59df6935 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
@@ -491,10 +491,10 @@ public class HoodieMetadataTableValidator implements 
Serializable {
   .setConf(jsc.hadoopConfiguration()).setBasePath(new 
Path(cfg.basePath, HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH).toString())
   .setLoadActiveTimelineOnLoad(true)
   .build();
-  int finishedInstants = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().countInstants();
+  int finishedInstants = 
mdtMetaClient.getCommitsTimeline().filterCompletedInstants().countInstants();
   if (finishedInstants == 0) {
-if 
(metaClient.getActiveTimeline().filterCompletedInstants().countInstants() == 0) 
{
-  LOG.info("There is no completed instant both in metadata table and 
corresponding data table.");
+if 
(metaClient.getCommitsTimeline().filterCompletedInstants().countInstants() == 
0) {
+  LOG.info("There is no completed commit in both metadata table and 
corresponding data table.");
   return false;
 } else {
   throw new HoodieValidationException("There is no completed instant 
for metadata table.");



[GitHub] [hudi] yihua merged pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator

2023-08-10 Thread via GitHub


yihua merged PR #9405:
URL: https://github.com/apache/hudi/pull/9405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator

2023-08-10 Thread via GitHub


yihua commented on PR #9405:
URL: https://github.com/apache/hudi/pull/9405#issuecomment-1674170095

   Azure CI timeout is irrelevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive

2023-08-10 Thread via GitHub


Zouxxyy commented on code in PR #9416:
URL: https://github.com/apache/hudi/pull/9416#discussion_r1290859610


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -452,107 +431,137 @@ private Stream 
getCommitInstantsToArchive() throws IOException {
   ? CompactionUtils.getOldestInstantToRetainForCompaction(
   table.getActiveTimeline(), 
config.getInlineCompactDeltaCommitMax())
   : Option.empty();
+  oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction);
 
-  // The clustering commit instant can not be archived unless we ensure 
that the replaced files have been cleaned,
+  // 3. The clustering commit instant can not be archived unless we ensure 
that the replaced files have been cleaned,
   // without the replaced files metadata on the timeline, the fs view 
would expose duplicates for readers.
   // Meanwhile, when inline or async clustering is enabled, we need to 
ensure that there is a commit in the active timeline
   // to check whether the file slice generated in pending clustering after 
archive isn't committed.
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
+  oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering);
+
+  // 4. If metadata table is enabled, do not archive instants which are 
more recent than the last compaction on the
+  // metadata table.
+  if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) {
+try (HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), 
config.getBasePath())) {
+  Option latestCompactionTime = 
tableMetadata.getLatestCompactionTime();
+  if (!latestCompactionTime.isPresent()) {
+LOG.info("Not archiving as there is no compaction yet on the 
metadata table");
+return Collections.emptyList();
+  } else {
+LOG.info("Limiting archiving of instants to latest compaction on 
metadata table at " + latestCompactionTime.get());
+oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant(
+HoodieInstant.State.COMPLETED, COMPACTION_ACTION, 
latestCompactionTime.get(;
+  }
+} catch (Exception e) {
+  throw new HoodieException("Error limiting instant archival based on 
metadata table", e);
+}
+  }
+
+  // 5. If this is a metadata table, do not archive the commits that live 
in data set
+  // active timeline. This is required by metadata table,
+  // see HoodieTableMetadataUtil#processRollbackMetadata for details.
+  if (table.isMetadataTable()) {
+HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
+
.setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+.setConf(metaClient.getHadoopConf())
+.build();
+Option qualifiedEarliestInstant =
+TimelineUtils.getEarliestInstantForMetadataArchival(
+dataMetaClient.getActiveTimeline(), 
config.shouldArchiveBeyondSavepoint());
+
+// Do not archive the instants after the earliest commit (COMMIT, 
DELTA_COMMIT, and
+// REPLACE_COMMIT only, considering non-savepoint commit only if 
enabling archive
+// beyond savepoint) and the earliest inflight instant (all actions).
+// This is required by metadata table, see 
HoodieTableMetadataUtil#processRollbackMetadata
+// for details.
+// Todo: Remove #7580

Review Comment:
   After this PR, #7580 is no useful, consider remove or simplify it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


yihua commented on code in PR #9421:
URL: https://github.com/apache/hudi/pull/9421#discussion_r1290853583


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##
@@ -217,7 +217,7 @@ public static Pair> 
filterAndGenerateChe
   row = collectedRows.select(queryInfo.getOrderColumn(), 
queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy(
   col(queryInfo.getOrderColumn()).desc(), 
col(queryInfo.getKeyColumn()).desc()).first();
 }
-LOG.info("Processed batch size: " + row.getLong(2) + " bytes");
+LOG.info("Processed batch size: " + 
row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes");

Review Comment:
   Got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] JingFengWang opened a new issue, #9424: 'read.utc-timezone=false' has no effect on writes

2023-08-10 Thread via GitHub


JingFengWang opened a new issue, #9424:
URL: https://github.com/apache/hudi/issues/9424

   **_Tips before filing an issue_**
   hudi 0.14.0 hudi-flink-bundle The COW/MOR table type writes timestamp data, 
and the time zone for writing data when read.utc-timezone=false is set is still 
the UTC time zone.
   AvroToRowDataConverters and RowDataToAvroConverters timestamp time zone 
conversion is hardcoded to UTC time zone.
   
   **Describe the problem you faced**
   1. hudi-flink1.13-bundle-0.14.0-rc1 write timestamp does not support 
configuration time zone type
   2. The read.utc-timezone attribute only takes effect when the data is read
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. ./bin/sql-client.sh embedded -j hudi-flink1.13-bundle-0.14.0-rc1.jar shell
   2. When setting read.utc-timezone=true, it is normal to write query 
timestamp data
   3. When setting read.utc-timezone= false to write data, the query time will 
be -8 hours
   ```sql
   Flink SQL> select LOCALTIMESTAMP as tm, timestamph from 
hudi_mor_all_datatype_2 where inth=44;
   ++-+-+
   | op |  tm |  timestamph |
   ++-+-+
   | +I | 2023-08-11 10:36:38.793 | 2023-08-11 03:10:17.267 |
   ++-+-+
   ```
   
   **Expected behavior**
   
   hudi-flink1.13-bundle supports writing timestamps in non-UTC time zones in a 
configurable way
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.2.0
   
   * Flink version: 1.13.2
   
   * Hive version : 1.11.1
   
   * Hadoop version : 3.x
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   **Related code location**
   ```java
   public class AvroToRowDataConverters {
   // ...
 private static AvroToRowDataConverter createTimestampConverter(int 
precision) {
   // ...
   return avroObject -> {
 final Instant instant;
 if (avroObject instanceof Long) {
   instant = Instant.EPOCH.plus((Long) avroObject, chronoUnit);
 } else if (avroObject instanceof Instant) {
   instant = (Instant) avroObject;
 } else {
   JodaConverter jodaConverter = JodaConverter.getConverter();
   if (jodaConverter != null) {
 // joda time has only millisecond precision
 instant = 
Instant.ofEpochMilli(jodaConverter.convertTimestamp(avroObject));
   } else {
 throw new IllegalArgumentException(
 "Unexpected object type for TIMESTAMP logical type. Received: 
" + avroObject);
   }
 }
 // TODO:Hardcoded to UTC here
 return TimestampData.fromInstant(instant);
   };
 }
   // ...
   }
   
   public class RowDataToAvroConverters {
   // ...
 public static RowDataToAvroConverter createConverter(LogicalType type) {
 // ...
 case TIMESTAMP_WITHOUT_TIME_ZONE:
   final int precision = DataTypeUtils.precision(type);
   if (precision <= 3) {
 converter =
 new RowDataToAvroConverter() {
   private static final long serialVersionUID = 1L;
   
   @Override
   public Object convert(Schema schema, Object object) {
 // TODO:Hardcoded to UTC here
 return ((TimestampData) object).toInstant().toEpochMilli();
   }
 };
   } else if (precision <= 6) {
 converter =
 new RowDataToAvroConverter() {
   private static final long serialVersionUID = 1L;
   
   @Override
   public Object convert(Schema schema, Object object) {
 // TODO:Hardcoded to UTC here
 Instant instant = ((TimestampData) object).toInstant();
 return  
Math.addExact(Math.multiplyExact(instant.getEpochSecond(), 1000_000), 
instant.getNano() / 1000);
   }
 };
   } else {
 throw new UnsupportedOperationException("Unsupported timestamp 
precision: " + precision);
   }
   break;
 // ...
 }
   // ...
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9344: [SUPPORT] Getting error when writing to different HUDI tables in different threads in same job

2023-08-10 Thread via GitHub


danny0405 commented on issue #9344:
URL: https://github.com/apache/hudi/issues/9344#issuecomment-1674158246

   I'm assuming you are using the MDT, did you check the existence of the 
missing file:
   
   ```xml
... 1 more
   Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://***/hudi_parallel_process/assets/asset_group/c9a7b1d3-c065-4902-a605-0fc114f33b2c-0_0-370-76132_20230801080422725.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Unify class name of Spark Procedure (#9414)

2023-08-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d288d97fb40 [MINOR] Unify class name of Spark Procedure (#9414)
d288d97fb40 is described below

commit d288d97fb4031e71afce6ee3cfe7c286f3204e76
Author: Kunni 
AuthorDate: Fri Aug 11 10:57:48 2023 +0800

[MINOR] Unify class name of Spark Procedure (#9414)
---
 .../{CopyToTempView.scala => CopyToTempViewProcedure.scala}   | 8 
 .../spark/sql/hudi/command/procedures/HoodieProcedures.scala  | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala
similarity index 95%
rename from 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala
rename to 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala
index 89c00dac6e4..a23eea1363e 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala
@@ -24,7 +24,7 @@ import org.apache.spark.sql.types.{DataTypes, Metadata, 
StructField, StructType}
 
 import java.util.function.Supplier
 
-class CopyToTempView extends BaseProcedure with ProcedureBuilder with Logging {
+class CopyToTempViewProcedure extends BaseProcedure with ProcedureBuilder with 
Logging {
 
   private val PARAMETERS = Array[ProcedureParameter](
 ProcedureParameter.required(0, "table", DataTypes.StringType),
@@ -102,13 +102,13 @@ class CopyToTempView extends BaseProcedure with 
ProcedureBuilder with Logging {
 Seq(Row(0))
   }
 
-  override def build = new CopyToTempView()
+  override def build = new CopyToTempViewProcedure()
 }
 
-object CopyToTempView {
+object CopyToTempViewProcedure {
   val NAME = "copy_to_temp_view"
 
   def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] {
-override def get() = new CopyToTempView()
+override def get() = new CopyToTempViewProcedure()
   }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
index d54c9811925..ad63ddbb29e 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala
@@ -84,7 +84,7 @@ object HoodieProcedures {
   ,(ValidateHoodieSyncProcedure.NAME, ValidateHoodieSyncProcedure.builder)
   ,(ShowInvalidParquetProcedure.NAME, ShowInvalidParquetProcedure.builder)
   ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder)
-  ,(CopyToTempView.NAME, CopyToTempView.builder)
+  ,(CopyToTempViewProcedure.NAME, CopyToTempViewProcedure.builder)
   ,(ShowCommitExtraMetadataProcedure.NAME, 
ShowCommitExtraMetadataProcedure.builder)
   ,(ShowTablePropertiesProcedure.NAME, 
ShowTablePropertiesProcedure.builder)
   ,(HelpProcedure.NAME, HelpProcedure.builder)



[GitHub] [hudi] danny0405 merged pull request #9414: [MINOR] Unify class name of Spark Procedure

2023-08-10 Thread via GitHub


danny0405 merged PR #9414:
URL: https://github.com/apache/hudi/pull/9414


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed issue #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit

2023-08-10 Thread via GitHub


danny0405 closed issue #9420: [SUPPORT] - Fixing the info log to fetch column 
value by name instead of index in function 
filterAndGenerateCheckpointBasedOnSourceLimit
URL: https://github.com/apache/hudi/issues/9420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit

2023-08-10 Thread via GitHub


danny0405 commented on issue #9420:
URL: https://github.com/apache/hudi/issues/9420#issuecomment-1674155705

   Fixed in https://github.com/apache/hudi/pull/9421.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (e6d1e419c99 -> 6a8f00a1820)

2023-08-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e6d1e419c99 [MINOR] Increase CI timeout for UT FT other modules to 4 
hours (#9423)
 add 6a8f00a1820 [HUDI-6680] Fixing the info log to fetch column value by 
name instead of index (#9421)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [hudi] danny0405 merged pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


danny0405 merged PR #9421:
URL: https://github.com/apache/hudi/pull/9421


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9413: [HUDI-6675] Fix Clean action will delete the whole table

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #9413:
URL: https://github.com/apache/hudi/pull/9413#discussion_r1290848015


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java:
##
@@ -147,7 +148,9 @@ List clean(HoodieEngineContext context, 
HoodieCleanerPlan clean
 List partitionsToBeDeleted = 
cleanerPlan.getPartitionsToBeDeleted() != null ? 
cleanerPlan.getPartitionsToBeDeleted() : new ArrayList<>();
 partitionsToBeDeleted.forEach(entry -> {
   try {
-deleteFileAndGetResult(table.getMetaClient().getFs(), 
table.getMetaClient().getBasePath() + "/" + entry);
+if (!StringUtils.isNullOrEmpty(entry)) {
+  deleteFileAndGetResult(table.getMetaClient().getFs(), 
table.getMetaClient().getBasePath() + "/" + entry);

Review Comment:
   Kind of think the `cleanerPlan.getPartitionsToBeDeleted()` should be fixed, 
can we write a test case for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-10 Thread Danny Chen (Jira)
Danny Chen created HUDI-6683:


 Summary: Added kafka key as part of hudi metadata columns for Json 
& Avro KafkaSource
 Key: HUDI-6683
 URL: https://issues.apache.org/jira/browse/HUDI-6683
 Project: Apache Hudi
  Issue Type: New Feature
  Components: deltastreamer
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423)

2023-08-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e6d1e419c99 [MINOR] Increase CI timeout for UT FT other modules to 4 
hours (#9423)
e6d1e419c99 is described below

commit e6d1e419c99f8226c831b6ccbcd22b07510f0fbc
Author: Sagar Sumit 
AuthorDate: Fri Aug 11 08:12:38 2023 +0530

[MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423)
---
 azure-pipelines-20230430.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml
index 75c231b74dc..2da5ab0d4f9 100644
--- a/azure-pipelines-20230430.yml
+++ b/azure-pipelines-20230430.yml
@@ -188,7 +188,7 @@ stages:
 displayName: Top 100 long-running testcases
   - job: UT_FT_4
 displayName: UT FT other modules
-timeoutInMinutes: '180'
+timeoutInMinutes: '240'
 steps:
   - task: Maven@4
 displayName: maven install



[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #9412:
URL: https://github.com/apache/hudi/pull/9412#discussion_r1290844141


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala:
##
@@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase {
 }
   }
 
+  test("Test create table like") {
+if (HoodieSparkUtils.gteqSpark3_1) {
+  // 1. Test create table from an existing HUDI table
+  withTempDir { tmp =>

Review Comment:
   We should avoid misusage of possible, or make it clear on the document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours

2023-08-10 Thread via GitHub


nsivabalan merged PR #9423:
URL: https://github.com/apache/hudi/pull/9423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours

2023-08-10 Thread via GitHub


danny0405 commented on PR #9423:
URL: https://github.com/apache/hudi/pull/9423#issuecomment-1674148873

   4 hours is quite long, not sure we should do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] asyncService log prompt incomplete (#9407)

2023-08-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 28d43f3a4f9 [MINOR] asyncService log prompt incomplete (#9407)
28d43f3a4f9 is described below

commit 28d43f3a4f92c4996712cdb5abc13e0b2b7897e8
Author: empcl <1515827...@qq.com>
AuthorDate: Fri Aug 11 10:38:10 2023 +0800

[MINOR] asyncService log prompt incomplete (#9407)
---
 .../src/main/java/org/apache/hudi/async/HoodieAsyncService.java   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
index 4c1dddf265e..f022e710456 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
@@ -196,11 +196,11 @@ public abstract class HoodieAsyncService implements 
Serializable {
   }
 
   /**
-   * Enqueues new pending clustering instant.
+   * Enqueues new pending table service instant.
* @param instant {@link HoodieInstant} to enqueue.
*/
   public void enqueuePendingAsyncServiceInstant(HoodieInstant instant) {
-LOG.info("Enqueuing new pending clustering instant: " + 
instant.getTimestamp());
+LOG.info("Enqueuing new pending table service instant: " + 
instant.getTimestamp());
 pendingInstants.add(instant);
   }
 



[GitHub] [hudi] danny0405 merged pull request #9407: asyncService log prompt incomplete

2023-08-10 Thread via GitHub


danny0405 merged PR #9407:
URL: https://github.com/apache/hudi/pull/9407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6682) Redistribute Azure CI test modules to reduce overall time for UT FT other module

2023-08-10 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-6682:
-

 Summary: Redistribute Azure CI test modules to reduce overall time 
for UT FT other module
 Key: HUDI-6682
 URL: https://issues.apache.org/jira/browse/HUDI-6682
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] codope opened a new pull request, #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours

2023-08-10 Thread via GitHub


codope opened a new pull request, #9423:
URL: https://github.com/apache/hudi/pull/9423

   ### Change Logs
   
   UT FT other modules consistenly taking more than 3 hours. HUDI-6682 to track 
better redistribution of tests.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9384: [SUPPORT] TransactionParticipant not getting created

2023-08-10 Thread via GitHub


danny0405 commented on issue #9384:
URL: https://github.com/apache/hudi/issues/9384#issuecomment-1674140682

   Not quite sure, but the jar you used seems requiring the TLS authentication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on issue #8843: Memory leak caused by hudi if got exception when constructing record reader

2023-08-10 Thread via GitHub


voonhous commented on issue #8843:
URL: https://github.com/apache/hudi/issues/8843#issuecomment-1674139125

   Refer to stack trace here:
   
   https://github.com/apache/hudi/pull/8839#issuecomment-1674138771


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6679] Fix initialization of metadata table partitions upon failure (#9419)

2023-08-10 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8ccd7da2936 [HUDI-6679] Fix initialization of metadata table 
partitions upon failure (#9419)
8ccd7da2936 is described below

commit 8ccd7da293620ee94fb08035c04ddc595651332f
Author: Y Ethan Guo 
AuthorDate: Thu Aug 10 19:17:07 2023 -0700

[HUDI-6679] Fix initialization of metadata table partitions upon failure 
(#9419)
---
 .../hudi/client/BaseHoodieTableServiceClient.java  |   8 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |   7 +-
 .../functional/TestHoodieBackedMetadata.java   | 123 -
 3 files changed, 128 insertions(+), 10 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index e55fb045e1e..7e78bddd875 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -57,7 +57,6 @@ import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieLogCompactException;
 import org.apache.hudi.exception.HoodieRollbackException;
-import org.apache.hudi.metadata.HoodieTableMetadata;
 import org.apache.hudi.metadata.HoodieTableMetadataWriter;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.table.action.HoodieWriteMetadata;
@@ -88,6 +87,7 @@ import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
+import static org.apache.hudi.metadata.HoodieTableMetadata.isMetadataTable;
 import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit;
 
 /**
@@ -932,8 +932,10 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
 LinkedHashMap> 
reverseSortedRollbackInstants = instantsToRollback.entrySet()
 .stream().sorted((i1, i2) -> i2.getKey().compareTo(i1.getKey()))
 .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, 
e2) -> e1, LinkedHashMap::new));
+boolean isMetadataTable = isMetadataTable(basePath);
 for (Map.Entry> entry : 
reverseSortedRollbackInstants.entrySet()) {
-  if (HoodieTimeline.compareTimestamps(entry.getKey(), 
HoodieTimeline.LESSER_THAN_OR_EQUALS,
+  if (!isMetadataTable
+  && HoodieTimeline.compareTimestamps(entry.getKey(), 
HoodieTimeline.LESSER_THAN_OR_EQUALS,
   HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS)) {
 // do we need to handle failed rollback of a bootstrap
 rollbackFailedBootstrap();
@@ -954,7 +956,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   // from the async indexer (`HoodieIndexer`).
   // TODO(HUDI-5733): This should be cleaned up once the proper fix of 
rollbacks in the
   //  metadata table is landed.
-  if 
(HoodieTableMetadata.isMetadataTable(metaClient.getBasePathV2().toString())) {
+  if (isMetadataTable(metaClient.getBasePathV2().toString())) {
 return 
inflightInstantsStream.map(HoodieInstant::getTimestamp).filter(entry -> {
   if (curInstantTime.isPresent()) {
 return !entry.equals(curInstantTime.get());
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 4f965e587cb..74d8ae16176 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -112,7 +112,6 @@ import static 
org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deseri
 import static 
org.apache.hudi.metadata.HoodieTableMetadata.METADATA_TABLE_NAME_SUFFIX;
 import static 
org.apache.hudi.metadata.HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP;
 import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.createRollbackTimestamp;
-import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getInflightAndCompletedMetadataPartitions;
 import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getInflightMetadataPartitions;
 
 /**
@@ -257,10 +256,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   // check if any of the enabl

[GitHub] [hudi] voonhous commented on pull request #8839: [HUDI-6287] Fix Memory Leak in RealtimeCompactedRecordReader

2023-08-10 Thread via GitHub


voonhous commented on PR #8839:
URL: https://github.com/apache/hudi/pull/8839#issuecomment-1674138771

   ```text
   2023-08-11T00:17:44.546+0800WARN
20230810_161734_00541_uhtxz.1.104.0-48-1048 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory   I/O error constructing 
remote block reader.
   java.net.SocketException: Connection reset
   at 
java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
   at 
java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
   at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
   at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
   at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
   at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
   at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
   at 
java.base/java.io.FilterInputStream.read(FilterInputStream.java:82)
   at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547)
   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:407)
   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
   at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
   at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:649)
   at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:580)
   at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:762)
   at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:834)
   at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:686)
   at 
java.base/java.io.FilterInputStream.read(FilterInputStream.java:82)
   at 
java.base/java.io.FilterInputStream.read(FilterInputStream.java:82)
   at 
org.apache.parquet.io.DelegatingSeekableInputStream.read(DelegatingSeekableInputStream.java:61)
   at 
org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83)
   at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:548)
   at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:528)
   at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:522)
   at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:470)
   at 
org.apache.hudi.common.table.TableSchemaResolver.readSchemaFromParquetBaseFile(TableSchemaResolver.java:349)
   at 
org.apache.hudi.common.table.TableSchemaResolver.readSchemaFromBaseFile(TableSchemaResolver.java:549)
   at 
org.apache.hudi.common.table.TableSchemaResolver.fetchSchemaFromFiles(TableSchemaResolver.java:541)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableParquetSchemaFromDataFile(TableSchemaResolver.java:266)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:119)
   at 
org.apache.hudi.common.table.TableSchemaResolver.hasOperationField(TableSchemaResolver.java:472)
   at org.apache.hudi.util.Lazy.get(Lazy.java:53)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableSchemaFromLatestCommitMetadata(TableSchemaResolver.java:223)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaInternal(TableSchemaResolver.java:191)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchema(TableSchemaResolver.java:140)
   at 
org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchema(TableSchemaResolver.java:129)
   at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:144)
   at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:96)
   at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:64)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
   at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:81)
   at 
io.trino.plugin.hudi.HudiRecordCursors.createRecordReader(HudiRecordCursors.java:109)
   at 
io.trino.plugin.hudi.HudiRecordCursors.la

[GitHub] [hudi] codope merged pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


codope merged PR #9419:
URL: https://github.com/apache/hudi/pull/9419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


codope commented on PR #9419:
URL: https://github.com/apache/hudi/pull/9419#issuecomment-1674139030

   I am landing it to save CI cycles. There are no failures in UT FT other 
modules. It's just timing out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-10 Thread via GitHub


boneanxs commented on code in PR #9412:
URL: https://github.com/apache/hudi/pull/9412#discussion_r1290833633


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala:
##
@@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase {
 }
   }
 
+  test("Test create table like") {
+if (HoodieSparkUtils.gteqSpark3_1) {
+  // 1. Test create table from an existing HUDI table
+  withTempDir { tmp =>

Review Comment:
   Spark2 will use Spark own `CreateTableLikeCommand`, we can't throw error 
here since we can't distinguish whether the user want to create hudi table or 
not.
   
   ```scala
   * The syntax of using this command in SQL is(it doesn't support pass 
targetTable's provider):
* {{{
*   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
*   LIKE [other_db_name.]existing_table_name [locationSpec]
* }}}
   ```
   
   Spark2 is becoming depreciated, maybe only supporting spark3+ is enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674135920

   
   ## CI report:
   
   * f43453d4e334097d34f4606137247d217fdd253c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674127768

   
   ## CI report:
   
   * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255)
 
   * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo schema compatible with older versions

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #9415:
URL: https://github.com/apache/hudi/pull/9415#discussion_r1290822177


##
hudi-common/src/main/avro/HoodieMetadata.avsc:
##
@@ -369,7 +369,7 @@
"name": "HoodieRecordIndexInfo",
 "fields": [
 {
-"name": "partitionName",
+"name": "partition",
 "type": [

Review Comment:
   Can we write a compatibility test for this class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #9412:
URL: https://github.com/apache/hudi/pull/9412#discussion_r1290821705


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala:
##
@@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase {
 }
   }
 
+  test("Test create table like") {
+if (HoodieSparkUtils.gteqSpark3_1) {
+  // 1. Test create table from an existing HUDI table
+  withTempDir { tmp =>

Review Comment:
   So spark2 will throw exception ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8542: [HUDI-6123] Auto adjust lock configs only for single writer

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #8542:
URL: https://github.com/apache/hudi/pull/8542#discussion_r1290819723


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2483,8 +2483,15 @@ public boolean areReleaseResourceEnabled() {
   /**
* Returns whether the explicit guard of lock is required.
*/
-  public boolean needsLockGuard() {
-return isMetadataTableEnabled() || 
getWriteConcurrencyMode().supportsOptimisticConcurrencyControl();
+  public boolean isLockRequired() {
+return !isDefaultLockProvider() || 
getWriteConcurrencyMode().supportsOptimisticConcurrencyControl();

Review Comment:
   Yeah, I was expecting the use to set up the optimistic concurrency control 
explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #8111:
URL: https://github.com/apache/hudi/pull/8111#discussion_r1290818989


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java:
##
@@ -140,8 +140,10 @@ public void 
testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType
 put(ASYNC_CLEAN.key(), "false");
 put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true");
   }
-}), true, true, true, 
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL,
-HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName);
+}), true, true, true,

Review Comment:
   If metadata table is enabled, the lock should take effect, because the 
default lock provider class is `ZookeeperBasedLockProvider`, so at least, the 
in-progress lock should work.
   
   See `HoodieWriteConfig.isLockRequired`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled

2023-08-10 Thread via GitHub


danny0405 commented on code in PR #8111:
URL: https://github.com/apache/hudi/pull/8111#discussion_r1290818989


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java:
##
@@ -140,8 +140,10 @@ public void 
testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType
 put(ASYNC_CLEAN.key(), "false");
 put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true");
   }
-}), true, true, true, 
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL,
-HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName);
+}), true, true, true,

Review Comment:
   If metadata table is enabled, the lock should take effect, because the 
default lock provider class is `ZookeeperBasedLockProvider`, so at least, the 
in-progress lock should work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674105473

   
   ## CI report:
   
   * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255)
 
   * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9419:
URL: https://github.com/apache/hudi/pull/9419#issuecomment-1674072360

   
   ## CI report:
   
   * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19252)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9421:
URL: https://github.com/apache/hudi/pull/9421#issuecomment-1674067153

   
   ## CI report:
   
   * 7cd01addabe76c50feb22f32c652a30be4902643 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19253)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] empcl closed pull request #9417: Database not found exception when resolving Spark synchronization hive

2023-08-10 Thread via GitHub


empcl closed pull request #9417: Database not found exception when resolving 
Spark synchronization hive
URL: https://github.com/apache/hudi/pull/9417


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674038853

   
   ## CI report:
   
   * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674038750

   
   ## CI report:
   
   * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230)
 
   * f43453d4e334097d34f4606137247d217fdd253c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674033378

   
   ## CI report:
   
   * 4b7280a248b923a107a71d7a741b971f140731e4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9409:
URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674033270

   
   ## CI report:
   
   * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230)
 
   * f43453d4e334097d34f4606137247d217fdd253c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-10 Thread via GitHub


jonvex commented on code in PR #9409:
URL: https://github.com/apache/hudi/pull/9409#discussion_r1290738278


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -148,7 +148,7 @@ case class HoodieFileIndex(spark: SparkSession,
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
 val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, 
partitionFilters).map {
   case (partitionOpt, fileSlices) =>
-if (shouldBroadcast) {
+if (shouldEmbedFileSlices) {

Review Comment:
   No it should not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6681:
--
Status: Patch Available  (was: In Progress)

> Ensure MOR Column Stats Index skips reading filegroups correctly
> 
>
> Key: HUDI-6681
> URL: https://issues.apache.org/jira/browse/HUDI-6681
> Project: Apache Hudi
>  Issue Type: Test
>  Components: metadata, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Write tests to ensure Column Stats Index functions as expected for MOR tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6681:
--
Status: In Progress  (was: Open)

> Ensure MOR Column Stats Index skips reading filegroups correctly
> 
>
> Key: HUDI-6681
> URL: https://issues.apache.org/jira/browse/HUDI-6681
> Project: Apache Hudi
>  Issue Type: Test
>  Components: metadata, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Write tests to ensure Column Stats Index functions as expected for MOR tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6681:
-
Labels: pull-request-available  (was: )

> Ensure MOR Column Stats Index skips reading filegroups correctly
> 
>
> Key: HUDI-6681
> URL: https://issues.apache.org/jira/browse/HUDI-6681
> Project: Apache Hudi
>  Issue Type: Test
>  Components: metadata, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Write tests to ensure Column Stats Index functions as expected for MOR tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread via GitHub


jonvex opened a new pull request, #9422:
URL: https://github.com/apache/hudi/pull/9422

   ### Change Logs
   
   Create tests for MOR col stats index to ensure that filegroups are read as 
expected
   
   ### Impact
   
   Verification that the feature works
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly

2023-08-10 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6681:
-

 Summary: Ensure MOR Column Stats Index skips reading filegroups 
correctly
 Key: HUDI-6681
 URL: https://issues.apache.org/jira/browse/HUDI-6681
 Project: Apache Hudi
  Issue Type: Test
  Components: metadata, spark
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


Write tests to ensure Column Stats Index functions as expected for MOR tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1673969724

   
   ## CI report:
   
   * 533117e9428e103df8d8d94dad393c1961df4152 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19251)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #8542: [HUDI-6123] Auto adjust lock configs only for single writer

2023-08-10 Thread via GitHub


yihua commented on PR #8542:
URL: https://github.com/apache/hudi/pull/8542#issuecomment-1673934722

   > For multiple streaming writers with no explicit lock provider set up, 
InProcessLockProvider should not be used.
   
   In this case, user should explicitly set the lock provider as mentioned in 
the 
[docs](https://hudi.apache.org/docs/metadata#deployment-model-c-multi-writer).  
Auto config adjustment does not intend to solve this problem.
   
   Also, we need to update the docs.  This PR brings breaking changes to how 
configs work for the metadata table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokesh-lingarajan-0310 commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9421:
URL: https://github.com/apache/hudi/pull/9421#discussion_r1290689371


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##
@@ -217,7 +217,7 @@ public static Pair> 
filterAndGenerateChe
   row = collectedRows.select(queryInfo.getOrderColumn(), 
queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy(
   col(queryInfo.getOrderColumn()).desc(), 
col(queryInfo.getKeyColumn()).desc()).first();
 }
-LOG.info("Processed batch size: " + row.getLong(2) + " bytes");
+LOG.info("Processed batch size: " + 
row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes");

Review Comment:
   we hit class cast exception in some cases where spark inferred this field as 
double



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9421:
URL: https://github.com/apache/hudi/pull/9421#issuecomment-1673926017

   
   ## CI report:
   
   * 7cd01addabe76c50feb22f32c652a30be4902643 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19253)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9419:
URL: https://github.com/apache/hudi/pull/9419#issuecomment-1673925975

   
   ## CI report:
   
   * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19252)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9419:
URL: https://github.com/apache/hudi/pull/9419#issuecomment-1673916580

   
   ## CI report:
   
   * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9421:
URL: https://github.com/apache/hudi/pull/9421#issuecomment-1673916696

   
   ## CI report:
   
   * 7cd01addabe76c50feb22f32c652a30be4902643 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled

2023-08-10 Thread via GitHub


yihua commented on code in PR #8111:
URL: https://github.com/apache/hudi/pull/8111#discussion_r1290679214


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java:
##
@@ -140,8 +140,10 @@ public void 
testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType
 put(ASYNC_CLEAN.key(), "false");
 put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true");
   }
-}), true, true, true, 
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL,
-HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName);
+}), true, true, true,

Review Comment:
   We should revert the changes in this PR to a degree that the auto-adjustment 
of the lock configs still works for single-writer with async table services.  
Right now, auto-adjustment of the lock configs does not work for Deltastreamer 
with async table services when the metadata table is enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9417: Database not found exception when resolving Spark synchronization hive

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9417:
URL: https://github.com/apache/hudi/pull/9417#issuecomment-1673905253

   
   ## CI report:
   
   * 331d018c7d8b69232742aaee3a16062f692226ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19249)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6680) Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit

2023-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6680:
-
Labels: pull-request-available  (was: )

> Fixing the info log to fetch column value by name instead of index in 
> function filterAndGenerateCheckpointBasedOnSourceLimit
> 
>
> Key: HUDI-6680
> URL: https://issues.apache.org/jira/browse/HUDI-6680
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lokesh Lingarajan
>Priority: Major
>  Labels: pull-request-available
>
> Sometime spark inference engine identifies the cumulative column as type 
> double and this causes class cast exception trying to fetch it as Long.
> Reference - 
> https://github.com/apache/hudi/blob/dcf466fa48c2d54e490255bcb27f58adba7c1583/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L220



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


yihua commented on code in PR #9421:
URL: https://github.com/apache/hudi/pull/9421#discussion_r1290663391


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##
@@ -217,7 +217,7 @@ public static Pair> 
filterAndGenerateChe
   row = collectedRows.select(queryInfo.getOrderColumn(), 
queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy(
   col(queryInfo.getOrderColumn()).desc(), 
col(queryInfo.getKeyColumn()).desc()).first();
 }
-LOG.info("Processed batch size: " + row.getLong(2) + " bytes");
+LOG.info("Processed batch size: " + 
row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes");

Review Comment:
   I think the logic is correct before.  Just that we should not hard code the 
column position.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6680) Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit

2023-08-10 Thread Lokesh Lingarajan (Jira)
Lokesh Lingarajan created HUDI-6680:
---

 Summary: Fixing the info log to fetch column value by name instead 
of index in function filterAndGenerateCheckpointBasedOnSourceLimit
 Key: HUDI-6680
 URL: https://issues.apache.org/jira/browse/HUDI-6680
 Project: Apache Hudi
  Issue Type: Task
Reporter: Lokesh Lingarajan


Sometime spark inference engine identifies the cumulative column as type double 
and this causes class cast exception trying to fetch it as Long.

Reference - 
https://github.com/apache/hudi/blob/dcf466fa48c2d54e490255bcb27f58adba7c1583/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L220



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] lokesh-lingarajan-0310 opened a new pull request, #9421: [9420] - Fixing the info log to fetch column value by name instead of index

2023-08-10 Thread via GitHub


lokesh-lingarajan-0310 opened a new pull request, #9421:
URL: https://github.com/apache/hudi/pull/9421

   ### Change Logs
   
   Fixing the log statement to fetch column value by name instead of index
   
   ### Impact
   
   low
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokesh-lingarajan-0310 opened a new issue, #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLim

2023-08-10 Thread via GitHub


lokesh-lingarajan-0310 opened a new issue, #9420:
URL: https://github.com/apache/hudi/issues/9420

   Sometime spark inference engine identifies the cumulative column as type 
double and this causes class cast exception trying to fetch it as Long.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices

2023-08-10 Thread via GitHub


yihua commented on code in PR #9409:
URL: https://github.com/apache/hudi/pull/9409#discussion_r1290641231


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -148,7 +148,7 @@ case class HoodieFileIndex(spark: SparkSession,
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
 val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, 
partitionFilters).map {
   case (partitionOpt, fileSlices) =>
-if (shouldBroadcast) {
+if (shouldEmbedFileSlices) {

Review Comment:
   A side question: can `shouldEmbedFileSlices` be `true` for legacy file 
format as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6679:
-
Labels: pull-request-available  (was: )

> Fix initialization of metadata table partitions upon failure
> 
>
> Key: HUDI-6679
> URL: https://issues.apache.org/jira/browse/HUDI-6679
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When both files and record_index partitions are enabled, for the first commit 
> in the data table, the transaction fails when initializing the second 
> partition in the MDT.  In this case, the timelines look like below.  In this 
> case, when restarting the pipeline, the rollback triggers irrelevant 
> bootstrap rollback logic causing MDT to be corrupted, not properly 
> re-initializing the record_index partition.
> DT
> {code:java}
> .commit.requested
> .commit.inflight {code}
> MDT
> {code:java}
> 00010.deltacommit.requested
> 00010.deltacommit.inflight
> 00010.deltacommit
> 00011.deltacommit.requested
> 00011.deltacommit.inflight{code}
> Afterwards
> {code:java}
> ╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗
> ║ No. │ Instant              │ Action            │ State     │ Requested      
> │ Inflight       │ Completed      │ MT          │ MT        │ MT             
> │ MT             │ MT             ║
> ║     │                      │                   │           │ Time           
> │ Time           │ Time           │ Action      │ State     │ Requested      
> │ Inflight       │ Completed      ║
> ║     │                      │                   │           │                
> │                │                │             │           │ Time           
> │ Time           │ Time           ║
> ╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣
> ║ 0   │ 20230807063905364    │ rollback          │ COMPLETED │ 08-06 23:39:06 
> │ 08-06 23:39:07 │ 08-06 23:40:38 │ -           │ -         │ -              
> │ -              │ -              ║
> ║     │                      │ Rolls back        │           │                
> │                │                │             │           │                
> │                │                ║
> ║     │                      │ 20230807063647472 │           │                
> │                │                │             │           │                
> │                │                ║
> ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
> ║ 1   │ 20230807063905364010 │ -                 │ -         │ -              
> │ -              │ -              │ deltacommit │ COMPLETED │ 08-06 23:40:49 
> │ 08-06 23:40:49 │ 08-06 23:40:51 ║
> ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
> ║ 2   │ 20230807064006967    │ deltacommit       │ REQUESTED │ 08-06 23:40:39 
> │ -              │ -              │ -           │ -         │ -              
> │ -              │ -              ║
> ║     │                      │ Rolled back by    │           │                
> │                │                │             │           │                
> │                │                ║
> ║     │                      │ 20230807064227290 │           │                
> │                │                │             │           │                
> │                │                ║
> ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
> ║ 3   │ 20230807064041714    │ -                 │ -         │ -              
> │ -              │ -              │ restore     │ COMPLETED │ 08-06 23:40:43 
> │ 08-06 23:40:43 │ 08-06 23:40:48 ║
> ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
> ║ 4   │ 20230807064227290    │ rollback          │ INFLIGHT  │ 08-06 23:42:28 
> │ 08-06 23:42:29 │ -              │ -           │ -         │ -              
> │ -              │ -              ║
> ║

[GitHub] [hudi] yihua opened a new pull request, #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure

2023-08-10 Thread via GitHub


yihua opened a new pull request, #9419:
URL: https://github.com/apache/hudi/pull/9419

   ### Change Logs
   
   This PR fixes initialization of metadata table partitions upon failure:
   - In `BaseHoodieTableServiceClient.rollbackFailedWrites`, the fix avoids 
bootstrap rollback logic for MDT as MDT is never a bootstrap table and such 
logic can be accidentally triggered, since the MDT initial commits, e.g., 
`00010`, `00011`, are smaller than 
`FULL_BOOTSTRAP_INSTANT_TS` (`02`).
   - In `HoodieBackedTableMetadataWriter.initializeIfNeeded`, when async 
metadata indexing is disabled, if a partition is inflight, it means that the 
partition is not fully initialized, so the initialization should be triggered 
again.
   
   This scenario fails before the fix: When both files and record_index 
partitions are enabled, for the first commit in the data table, the transaction 
fails when initializing the second partition in the MDT.  In this case, the 
timelines look like below.  In this case, when restarting the pipeline, the 
rollback triggers irrelevant bootstrap rollback logic causing MDT to be 
corrupted, not properly re-initializing the record_index partition.
   
   DT
   ```
   .commit.requested
   .commit.inflight
   ```
   MDT
   ```
   00010.deltacommit.requested
   00010.deltacommit.inflight
   00010.deltacommit
   00011.deltacommit.requested
   00011.deltacommit.inflight
   ```
   Afterwards, `00010` is rolled back and bootstrap rollback logic 
adding restore kicks in, which are unexpected.
   ```
   
╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗
   ║ No. │ Instant  │ Action│ State │ Requested 
 │ Inflight   │ Completed  │ MT  │ MT│ MT │ 
MT │ MT ║
   ║ │  │   │   │ Time  
 │ Time   │ Time   │ Action  │ State │ Requested  │ 
Inflight   │ Completed  ║
   ║ │  │   │   │   
 │││ │   │ Time   │ 
Time   │ Time   ║
   
╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣
   ║ 0   │ 20230807063905364│ rollback  │ COMPLETED │ 08-06 
23:39:06 │ 08-06 23:39:07 │ 08-06 23:40:38 │ -   │ - │ -
  │ -  │ -  ║
   ║ │  │ Rolls back│   │   
 │││ │   ││ 
   │║
   ║ │  │ 20230807063647472 │   │   
 │││ │   ││ 
   │║
   
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
   ║ 1   │ 20230807063905364010 │ - │ - │ - 
 │ -  │ -  │ deltacommit │ COMPLETED │ 08-06 23:40:49 │ 
08-06 23:40:49 │ 08-06 23:40:51 ║
   
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
   ║ 2   │ 20230807064006967│ deltacommit   │ REQUESTED │ 08-06 
23:40:39 │ -  │ -  │ -   │ - │ -
  │ -  │ -  ║
   ║ │  │ Rolled back by│   │   
 │││ │   ││ 
   │║
   ║ │  │ 20230807064227290 │   │   
 │││ │   ││ 
   │║
   
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
   ║ 3   │ 20230807064041714│ - │ - │ - 
 │ -  │ -  │ restore │ COMPLETED │ 08-06 23:40:43 │ 
08-06 23:40:43 │ 08-06 23:40:48 ║
   
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
   ║ 4   │ 20230807064227290│ r

[GitHub] [hudi] hudi-bot commented on pull request #9403: [MINOR] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-10 Thread via GitHub


hudi-bot commented on PR #9403:
URL: https://github.com/apache/hudi/pull/9403#issuecomment-1673822139

   
   ## CI report:
   
   * b5846de9f43070cf38acd5bd90ae990cad1c2999 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19247)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6679:

Description: 
When both files and record_index partitions are enabled, for the first commit 
in the data table, the transaction fails when initializing the second partition 
in the MDT.  In this case, the timelines look like below.  In this case, when 
restarting the pipeline, the rollback triggers irrelevant bootstrap rollback 
logic causing MDT to be corrupted, not properly re-initializing the 
record_index partition.

DT
{code:java}
.commit.requested
.commit.inflight {code}
MDT
{code:java}
00010.deltacommit.requested
00010.deltacommit.inflight
00010.deltacommit
00011.deltacommit.requested
00011.deltacommit.inflight{code}
Afterwards
{code:java}
╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗
║ No. │ Instant              │ Action            │ State     │ Requested      │ 
Inflight       │ Completed      │ MT          │ MT        │ MT             │ MT 
            │ MT             ║
║     │                      │                   │           │ Time           │ 
Time           │ Time           │ Action      │ State     │ Requested      │ 
Inflight       │ Completed      ║
║     │                      │                   │           │                │ 
               │                │             │           │ Time           │ 
Time           │ Time           ║
╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣
║ 0   │ 20230807063905364    │ rollback          │ COMPLETED │ 08-06 23:39:06 │ 
08-06 23:39:07 │ 08-06 23:40:38 │ -           │ -         │ -              │ -  
            │ -              ║
║     │                      │ Rolls back        │           │                │ 
               │                │             │           │                │    
            │                ║
║     │                      │ 20230807063647472 │           │                │ 
               │                │             │           │                │    
            │                ║
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
║ 1   │ 20230807063905364010 │ -                 │ -         │ -              │ 
-              │ -              │ deltacommit │ COMPLETED │ 08-06 23:40:49 │ 
08-06 23:40:49 │ 08-06 23:40:51 ║
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
║ 2   │ 20230807064006967    │ deltacommit       │ REQUESTED │ 08-06 23:40:39 │ 
-              │ -              │ -           │ -         │ -              │ -  
            │ -              ║
║     │                      │ Rolled back by    │           │                │ 
               │                │             │           │                │    
            │                ║
║     │                      │ 20230807064227290 │           │                │ 
               │                │             │           │                │    
            │                ║
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
║ 3   │ 20230807064041714    │ -                 │ -         │ -              │ 
-              │ -              │ restore     │ COMPLETED │ 08-06 23:40:43 │ 
08-06 23:40:43 │ 08-06 23:40:48 ║
╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢
║ 4   │ 20230807064227290    │ rollback          │ INFLIGHT  │ 08-06 23:42:28 │ 
08-06 23:42:29 │ -              │ -           │ -         │ -              │ -  
            │ -              ║
║     │                      │ Rolls back        │           │                │ 
               │                │             │           │                │    
            │                ║
║     │                      │ 20230807064006967 │           │                │ 
               │                │             │           │                │    
            │                ║
╚═╧══╧═══╧═══╧╧╧╧═╧═══╧╧╧╝
 {code}
 
{code:java}
org.apache.hudi.exception.HoodieR

[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6679:

Description: 
When both files and record_index partitions are enabled, for the first commit 
in the data table, the transaction fails when initializing the second partition 
in the MDT.  In this case, the timelines look like below.  In this case, when 
restarting the pipeline, the rollback triggers irrelevant bootstrap rollback 
logic causing MDT to be corrupted.

DT
{code:java}
.commit.requested
.commit.inflight {code}
MDT
{code:java}
00010.deltacommit.requested
00010.deltacommit.inflight
00010.deltacommit
00011.deltacommit.requested
00011.deltacommit.inflight{code}
{code:java}
org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
s3a:///hoodie_table commits 20230807064006967
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
    at 
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920)
    at 
org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890)
    at 
org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767)
    at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445)
    at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767)
    at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
partition files should be >0
    at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098)
    at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135)
    at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013)
    at 
org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
    at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
    at 
org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141)
    at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901)
    ... 16 more {code}

  was:
 

 
{code:java}
org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
s3a:///hoodie_table commits 20230807064006967
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.r

[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6679:

Priority: Blocker  (was: Major)

> Fix initialization of metadata table partitions upon failure
> 
>
> Key: HUDI-6679
> URL: https://issues.apache.org/jira/browse/HUDI-6679
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
>
>  
>  
> {code:java}
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> s3a:///hoodie_table commits 20230807064006967
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
>     at 
> org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445)
>     at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767)
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
> partition files should be >0
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141)
>     at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901)
>     ... 16 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6679:

Fix Version/s: 0.14.0

> Fix initialization of metadata table partitions upon failure
> 
>
> Key: HUDI-6679
> URL: https://issues.apache.org/jira/browse/HUDI-6679
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
>  
>  
> {code:java}
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> s3a:///hoodie_table commits 20230807064006967
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
>     at 
> org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445)
>     at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767)
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
> partition files should be >0
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141)
>     at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901)
>     ... 16 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6679:
---

Assignee: Ethan Guo

> Fix initialization of metadata table partitions upon failure
> 
>
> Key: HUDI-6679
> URL: https://issues.apache.org/jira/browse/HUDI-6679
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
>  
>  
> {code:java}
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> s3a:///hoodie_table commits 20230807064006967
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
>     at 
> org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767)
>     at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445)
>     at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767)
>     at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
> partition files should be >0
>     at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135)
>     at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120)
>     at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141)
>     at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901)
>     ... 16 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6679:

Description: 
 

 
{code:java}
org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
s3a:///hoodie_table commits 20230807064006967
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
    at 
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927)
    at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920)
    at 
org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890)
    at 
org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767)
    at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445)
    at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767)
    at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
partition files should be >0
    at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098)
    at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135)
    at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013)
    at 
org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
    at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
    at 
org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120)
    at 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141)
    at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218)
    at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901)
    ... 16 more {code}

> Fix initialization of metadata table partitions upon failure
> 
>
> Key: HUDI-6679
> URL: https://issues.apache.org/jira/browse/HUDI-6679
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>
>  
>  
> {code:java}
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> s3a:///hoodie_table commits 20230807064006967
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723)
>     at 
> org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718)
>     at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928)
>     at 
> org.apache.hudi.common.util.CleanerUt

[jira] [Created] (HUDI-6679) Fix initialization of metadata table partitions upon failure

2023-08-10 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6679:
---

 Summary: Fix initialization of metadata table partitions upon 
failure
 Key: HUDI-6679
 URL: https://issues.apache.org/jira/browse/HUDI-6679
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >