Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1570022625


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   Kind of think people would be confused by the two `listFiles` and 
`listDirectory`, the former is to list the given path and return all the file 
paths contained underneath it, the later is to list the **directory** which 
means the given path must be a directory(do we have validation for that)? The 
naming convention is differnet when comparing these two.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11050:
URL: https://github.com/apache/hudi/pull/11050#discussion_r1570007033


##
hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java:
##
@@ -109,6 +109,11 @@ public long getModificationTime() {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());

Review Comment:
   There is no need for null check; path should always exist.
   
   We shouldn’t have a case for multiple path info instances with the same 
path, in the same file listing call.  So this should be OK.  And this is only 
used in Hudi CLI for sorting purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1570003488


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   The renaming is based on the suggestion: 
https://github.com/apache/hudi/pull/10591#discussion_r1484908969.  I think we 
don’t have to conform to the naming in Hadoop and I feel `listDirectory` is a 
better naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1569988592


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -651,6 +652,36 @@ public static Stream 
createColumnStatsRecords(String partitionName
 });
   }
 
+  public static Stream createPartitionStatsRecords(String 
partitionPath,
+ 
Collection> columnRangeMetadataList,
+ boolean 
isDeleted) {
+return columnRangeMetadataList.stream().map(columnRangeMetadata -> {
+  HoodieKey key = new HoodieKey(getPartitionStatsIndexKey(partitionPath, 
columnRangeMetadata),
+  MetadataPartitionType.PARTITION_STATS.getPartitionPath());
+
+  HoodieMetadataPayload payload = new 
HoodieMetadataPayload(key.getRecordKey(),
+  HoodieMetadataColumnStats.newBuilder()
+  .setFileName(null)
+  .setColumnName(columnRangeMetadata.getColumnName())
+  
.setMinValue(wrapValueIntoAvro(columnRangeMetadata.getMinValue()))
+  
.setMaxValue(wrapValueIntoAvro(columnRangeMetadata.getMaxValue()))
+  .setNullCount(columnRangeMetadata.getNullCount())
+  .setValueCount(columnRangeMetadata.getValueCount())
+  .setTotalSize(columnRangeMetadata.getTotalSize())
+  
.setTotalUncompressedSize(columnRangeMetadata.getTotalUncompressedSize())
+  .setIsDeleted(isDeleted)
+  .build());
+
+  return new HoodieAvroRecord<>(key, payload);
+});
+  }
+
+  public static String getPartitionStatsIndexKey(String partitionPath, 
HoodieColumnRangeMetadata columnRangeMetadata) {

Review Comment:
   nit: have the column name as the method argument directly to avoid confusion?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -351,6 +356,12 @@ case class HoodieFileIndex(spark: SparkSession,
   Option.empty
 } else if (recordKeys.nonEmpty) {
   Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), 
recordKeys))
+} else if (partitionStatsIndex.isIndexAvailable && queryFilters.nonEmpty) {
+  val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+  val shouldReadInMemory = partitionStatsIndex.shouldReadInMemory(this, 
queryReferencedColumns)
+  partitionStatsIndex.loadTransposed(queryReferencedColumns, 
shouldReadInMemory) { transposedColStatsDF =>

Review Comment:
   Could we avoid transposing here as it has overhead and directly leverage the 
index entry for filtering?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-17 Thread via GitHub


liangchen-datanerd commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2063009955

   @ad1happy2go 
   Based on the IoT scenario on which I've been working, the event time would 
be adopted as the partition column. At the same time, we would query data based 
on the original timestamp event time, not the transformed partition path. 
Implementing this feature would be great help. Should I close this issue or 
leave it open?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1569972641


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -224,6 +224,10 @@ private void enablePartitions() {
 if (dataMetaClient.getFunctionalIndexMetadata().isPresent()) {
   this.enabledPartitionTypes.add(FUNCTIONAL_INDEX);
 }
+if ((metadataConfig.isPartitionStatsIndexEnabled() && 
!metadataConfig.getColumnsEnabledForColumnStatsIndex().isEmpty())
+|| 
dataMetaClient.getTableConfig().isMetadataPartitionAvailable(PARTITION_STATS)) {
+  this.enabledPartitionTypes.add(PARTITION_STATS);

Review Comment:
   Can this part be generalized too, i.e., relying on the defined partition 
type enums to automatically check the corresponding write config and enable the 
index, instead of modifying the method manually?



##
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
 throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public static > HoodieColumnRangeMetadata 
getColumnRangeInPartition(@Nonnull List> 
fileRanges) {
+if (fileRanges.size() == 1) {
+  // Only one parquet file, we can just return that range.
+  return fileRanges.get(0);
+}
+// There are multiple files. Compute min(file_mins) and max(file_maxs)
+return fileRanges.stream()
+.sequential()
+.reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static  > HoodieColumnRangeMetadata 
mergeRanges(HoodieColumnRangeMetadata one,

Review Comment:
   ```suggestion
 private static > HoodieColumnRangeMetadata 
mergeRanges(HoodieColumnRangeMetadata one,
   ```



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1580,69 +1595,4 @@ public boolean isInitialized() {
   }
 
   protected abstract BaseHoodieWriteClient initializeWriteClient();
-
-  /**
-   * A class which represents a directory and the files and directories inside 
it.
-   * 
-   * A {@code PartitionFileInfo} object saves the name of the partition and 
various properties requires of each file
-   * required for initializing the metadata table. Saving limited properties 
reduces the total memory footprint when
-   * a very large number of files are present in the dataset being initialized.
-   */

Review Comment:
   So this is no longer needed?



##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java:
##
@@ -557,7 +558,8 @@ public void 
testDowngradeSixToFiveShouldDeleteRecordIndexPartition() throws Exce
 PARTITION_NAME_COLUMN_STATS,
 PARTITION_NAME_BLOOM_FILTERS,
 PARTITION_NAME_RECORD_INDEX,
-PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX
+PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX,

Review Comment:
   Should this list be generated from the production code, i.e., list of 
supported MDT partitions?  Also, do we need to upgrade the table version?  I 
think master branch is still on table version 6, the same as 0.14.0 release.



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -104,13 +104,14 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayloadhttps://issues.apache.org/jira/browse/AVRO-1810
+   * has a bug - https://issues.apache.org/jira/browse/AVRO-1810";>...
*/
-  protected static final int METADATA_TYPE_PARTITION_LIST = 1;
-  protected static final int METADATA_TYPE_FILE_LIST = 2;
-  protected static final int METADATA_TYPE_COLUMN_STATS = 3;
-  protected static final int METADATA_TYPE_BLOOM_FILTER = 4;
+  private static final int METADATA_TYPE_PARTITION_LIST = 1;
+  private static final int METADATA_TYPE_FILE_LIST = 2;
+  private static final int METADATA_TYPE_COLUMN_STATS = 3;
+  private static final int METADATA_TYPE_BLOOM_FILTER = 4;
   private static final int METADATA_TYPE_RECORD_INDEX = 5;
+  private static final int METADATA_TYPE_PARTITION_STATS = 6;

Review Comment:
   Maybe we should add enum and guarantee the ordering, and automatically 
assign the type ID?



##
hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java:
##
@@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat 
fileFormat) {
 throw new UnsupportedOperationException(fileFormat.name() + " format not 
supported yet.");
   }
 
+  /**
+   * Aggregate column range statistics across files in a partition.
+   *
+   * @param fileRanges List of column range statistics for each file in a 
partition
+   */
+  public 

Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11032:
URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062998871

   
   ## CI report:
   
   * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23339)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (999007b39d5 -> b3773657f69)

2024-04-17 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 999007b39d5 [HUDI-7636] Make StoragePath Serializable (#11049)
 add b3773657f69 [HUDI-7635] Add default block size and openSeekable APIs 
to HoodieStorage (#11048)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/common/fs/FSUtils.java| 19 -
 .../hudi/common/table/log/HoodieLogFileReader.java | 75 +-
 .../org/apache/hudi/hadoop/fs/HadoopFSUtils.java   | 90 ++
 .../hudi/storage/hadoop/HoodieHadoopStorage.java   | 13 
 .../org/apache/hudi/storage/HoodieStorage.java | 30 
 .../hudi/io/storage/TestHoodieStorageBase.java | 43 +++
 6 files changed, 179 insertions(+), 91 deletions(-)



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


codope merged PR #11048:
URL: https://github.com/apache/hudi/pull/11048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]

2024-04-17 Thread via GitHub


codope commented on code in PR #11050:
URL: https://github.com/apache/hudi/pull/11050#discussion_r1569960431


##
hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java:
##
@@ -109,6 +109,11 @@ public long getModificationTime() {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());

Review Comment:
   Practically, would it ever hit a scenario when this.getPath() == 
o.getPath()? If so, do need to break ties? If not, should we add a validation 
and throw IllegalStateException?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]

2024-04-17 Thread via GitHub


codope commented on code in PR #11050:
URL: https://github.com/apache/hudi/pull/11050#discussion_r1569961073


##
hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java:
##
@@ -109,6 +109,11 @@ public long getModificationTime() {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());

Review Comment:
   Also, is there a need for null check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (df5cdb7f290 -> 999007b39d5)

2024-04-17 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from df5cdb7f290 [MINOR] Remove redundant TestStringUtils in hudi-common 
(#11046)
 add 999007b39d5 [HUDI-7636] Make StoragePath Serializable (#11049)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/storage/StoragePath.java  | 14 +--
 .../apache/hudi/io/storage/TestStoragePath.java| 28 +-
 2 files changed, 39 insertions(+), 3 deletions(-)



Re: [PR] [HUDI-7636] Make StoragePath Serializable [hudi]

2024-04-17 Thread via GitHub


codope merged PR #11049:
URL: https://github.com/apache/hudi/pull/11049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Remove redundant TestStringUtils in hudi-common (#11046)

2024-04-17 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new df5cdb7f290 [MINOR] Remove redundant TestStringUtils in hudi-common 
(#11046)
df5cdb7f290 is described below

commit df5cdb7f290835c7d3c72debea39a17cb5d66e2c
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:34:06 2024 -0700

[MINOR] Remove redundant TestStringUtils in hudi-common (#11046)
---
 .../apache/hudi/common/util/TestStringUtils.java   | 124 -
 1 file changed, 124 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
deleted file mode 100644
index 54985056bf0..000
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.common.util;
-
-import org.junit.jupiter.api.Test;
-
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-
-import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertNotEquals;
-import static org.junit.jupiter.api.Assertions.assertNull;
-import static org.junit.jupiter.api.Assertions.assertTrue;
-
-/**
- * Tests {@link StringUtils}.
- */
-public class TestStringUtils {
-
-  private static final String[] STRINGS = {"This", "is", "a", "test"};
-
-  @Test
-  public void testStringJoinWithDelim() {
-String joinedString = StringUtils.joinUsingDelim("-", STRINGS);
-assertEquals(STRINGS.length, joinedString.split("-").length);
-  }
-
-  @Test
-  public void testStringJoin() {
-assertNotEquals(null, StringUtils.join(""));
-assertNotEquals(null, StringUtils.join(STRINGS));
-  }
-
-  @Test
-  public void testStringJoinWithJavaImpl() {
-assertNull(StringUtils.join(",", null));
-assertEquals("", String.join(",", Collections.singletonList("")));
-assertEquals(",", String.join(",", Arrays.asList("", "")));
-assertEquals("a,", String.join(",", Arrays.asList("a", "")));
-  }
-
-  @Test
-  public void testStringNullToEmpty() {
-String str = "This is a test";
-assertEquals(str, StringUtils.nullToEmpty(str));
-assertEquals("", StringUtils.nullToEmpty(null));
-  }
-
-  @Test
-  public void testStringObjToString() {
-assertNull(StringUtils.objToString(null));
-assertEquals("Test String", StringUtils.objToString("Test String"));
-
-// assert byte buffer
-ByteBuffer byteBuffer1 = ByteBuffer.wrap(getUTF8Bytes("1234"));
-ByteBuffer byteBuffer2 = ByteBuffer.wrap(getUTF8Bytes("5678"));
-// assert equal because ByteBuffer has overwritten the toString to return 
a summary string
-assertEquals(byteBuffer1.toString(), byteBuffer2.toString());
-// assert not equal
-assertNotEquals(StringUtils.objToString(byteBuffer1), 
StringUtils.objToString(byteBuffer2));
-  }
-
-  @Test
-  public void testStringEmptyToNull() {
-assertNull(StringUtils.emptyToNull(""));
-assertEquals("Test String", StringUtils.emptyToNull("Test String"));
-  }
-
-  @Test
-  public void testStringNullOrEmpty() {
-assertTrue(StringUtils.isNullOrEmpty(null));
-assertTrue(StringUtils.isNullOrEmpty(""));
-assertNotEquals(null, StringUtils.isNullOrEmpty("this is not empty"));
-assertTrue(StringUtils.isNullOrEmpty(""));
-  }
-
-  @Test
-  public void testSplit() {
-assertEquals(new ArrayList<>(), StringUtils.split(null, ","));
-assertEquals(new ArrayList<>(), StringUtils.split("", ","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b, c", 
","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b,, c ", 
","));
-  }
-
-  @Test
-  public void testHexString() {
-String str = "abcd";
-assertEquals(StringUtils.toHexString(getUTF8Bytes(str)), 
toHexString(getUTF8Bytes(str)));
-  }
-
-  private static String toHexString(byte[

Re: [PR] [MINOR] Remove redundant TestStringUtils in hudi-common module [hudi]

2024-04-17 Thread via GitHub


codope merged PR #11046:
URL: https://github.com/apache/hudi/pull/11046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7633] Use try with resources for AutoCloseable (#11045)

2024-04-17 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 78fc62554ce [HUDI-7633] Use try with resources for AutoCloseable 
(#11045)
78fc62554ce is described below

commit 78fc62554ce3798eb803332b42f84f9cfa74e526
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:31:44 2024 -0700

[HUDI-7633] Use try with resources for AutoCloseable (#11045)
---
 .../hudi/cli/commands/ArchivedCommitsCommand.java  | 104 
 .../apache/hudi/cli/commands/ExportCommand.java|  93 +++---
 .../hudi/cli/commands/HoodieLogFileCommand.java| 104 
 .../org/apache/hudi/cli/commands/TableCommand.java |   6 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |   8 +-
 .../hudi/common/model/HoodiePartitionMetadata.java |   8 +-
 .../hudi/common/table/log/LogReaderUtils.java  |  22 ++--
 .../table/log/block/HoodieAvroDataBlock.java   | 135 ++---
 .../hudi/common/util/SerializationUtils.java   |   6 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  24 ++--
 .../java/HoodieJavaWriteClientExample.java |  70 +--
 .../examples/spark/HoodieWriteClientExample.java   |  90 +++---
 .../org/apache/hudi/common/util/FileIOUtils.java   |  14 +--
 .../hudi/utilities/HoodieCompactionAdminTool.java  |   9 +-
 .../utilities/streamer/SchedulerConfGenerator.java |   6 +-
 15 files changed, 344 insertions(+), 355 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
index 075a57d541c..5c57c8f5288 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
@@ -114,47 +114,46 @@ public class ArchivedCommitsCommand {
 List allStats = new ArrayList<>();
 for (FileStatus fs : fsStatuses) {
   // read the archived file
-  Reader reader = HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, 
HoodieCLI.conf),
-  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema());
-
-  List readRecords = new ArrayList<>();
-  // read the avro blocks
-  while (reader.hasNext()) {
-HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+  try (Reader reader = 
HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, HoodieCLI.conf),
+  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema())) {
+List readRecords = new ArrayList<>();
+// read the avro blocks
+while (reader.hasNext()) {
+  HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
+  blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+}
+List readCommits = readRecords.stream().map(r -> 
(GenericRecord) r)
+.filter(r -> 
r.get("actionType").toString().equals(HoodieTimeline.COMMIT_ACTION)
+|| 
r.get("actionType").toString().equals(HoodieTimeline.DELTA_COMMIT_ACTION))
+.flatMap(r -> {
+  HoodieCommitMetadata metadata = (HoodieCommitMetadata) 
SpecificData.get()
+  .deepCopy(HoodieCommitMetadata.SCHEMA$, 
r.get("hoodieCommitMetadata"));
+  final String instantTime = r.get("commitTime").toString();
+  final String action = r.get("actionType").toString();
+  return 
metadata.getPartitionToWriteStats().values().stream().flatMap(hoodieWriteStats 
-> hoodieWriteStats.stream().map(hoodieWriteStat -> {
+List row = new ArrayList<>();
+row.add(action);
+row.add(instantTime);
+row.add(hoodieWriteStat.getPartitionPath());
+row.add(hoodieWriteStat.getFileId());
+row.add(hoodieWriteStat.getPrevCommit());
+row.add(hoodieWriteStat.getNumWrites());
+row.add(hoodieWriteStat.getNumInserts());
+row.add(hoodieWriteStat.getNumDeletes());
+row.add(hoodieWriteStat.getNumUpdateWrites());
+row.add(hoodieWriteStat.getTotalLogFiles());
+row.add(hoodieWriteStat.getTotalLogBlocks());
+row.add(hoodieWriteStat.getTotalCorruptLogBlock());
+row.add(hoodieWriteStat.getTotalRollbackBlocks());
+row.add(hoodieWriteStat.getTotalLogRecords());
+row.add(hoodieWriteStat.getTotalUpdatedRecordsCompacted());
+row.add(hoodieWriteStat.getTotalWriteBytes());
+row.add(hoodieWriteStat.getTo

Re: [PR] [HUDI-7633] Use try with resources for AutoCloseable [hudi]

2024-04-17 Thread via GitHub


codope merged PR #11045:
URL: https://github.com/apache/hudi/pull/11045


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11048:
URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062959240

   
   ## CI report:
   
   * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN
   * b3e4c349009dd1cba0181536563f51bdb8c4eb11 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23338)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569903965


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   Then maybe the name `listPathInfo` sounds more proper? Because the `listXXX` 
API indicates that the `XXX` is actually the item type of the return value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569903965


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   Then maybe the name `listPathInfo` sounds more proper?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569884258


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   The `FileSystem#listStatus` API: 
https://github.com/apache/hadoop/blob/branch-2.10.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1738
   ```
   
   /**
  * List the statuses of the files/directories in the given path if the 
path is
  * a directory.
  * 
  * Does not guarantee to return the List of files/directories status in a
  * sorted order.
  * @param f given path
  * @return the statuses of the files/directories in the given patch
  * @throws FileNotFoundException when the path does not exist
  * @throws IOException see specific implementation
  */
 public abstract FileStatus[] listStatus(Path f) throws 
FileNotFoundException,
IOException;
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11048:
URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062919526

   
   ## CI report:
   
   * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN
   * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331)
 
   * b3e4c349009dd1cba0181536563f51bdb8c4eb11 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23338)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11031:
URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062919442

   
   ## CI report:
   
   * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN
   * a7270a970988e18ba0997ee59b74678c2dc720ed Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23336)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11032:
URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062919460

   
   ## CI report:
   
   * 254fbf794c65c5d54251f388ec7ea8fdbae29d03 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23303)
 
   * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23339)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569877853


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   > this is not user behavior.This is a potential requirement with change
   
   But it seems the DDL and DML handle the case-sensitivity in different ways, 
which is not plausible here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569876744


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   What is the Hadoop API that we can reference?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11048:
URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062914281

   
   ## CI report:
   
   * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN
   * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331)
 
   * b3e4c349009dd1cba0181536563f51bdb8c4eb11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11032:
URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062914221

   
   ## CI report:
   
   * 254fbf794c65c5d54251f388ec7ea8fdbae29d03 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23303)
 
   * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569866172


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws 
IOException {
 }
   }
 
+  /**
+   * Opens an SeekableDataInputStream at the indicated path with seeks 
supported.
+   *
+   * @param path the file to open.
+   * @return the InputStream to read from.
+   * @throws IOException IO error.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public SeekableDataInputStream openSeekable(StoragePath path) throws 
IOException {

Review Comment:
   I think this one can still be overridden if the particular storage 
implementation provides optimization on this one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


jonvex commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569864095


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws 
IOException {
 }
   }
 
+  /**
+   * Opens an SeekableDataInputStream at the indicated path with seeks 
supported.
+   *
+   * @param path the file to open.
+   * @return the InputStream to read from.
+   * @throws IOException IO error.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public SeekableDataInputStream openSeekable(StoragePath path) throws 
IOException {

Review Comment:
   should it be final then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569863508


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   There is already a `listFiles` API which lists all files under the directory 
recursively, which provides a different semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569864294


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   The counterpart Hadoop API returns an array so I think we can keep the 
return type as List for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


jonvex commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569863920


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws 
IOException {
 }
   }
 
+  /**
+   * Opens an SeekableDataInputStream at the indicated path with seeks 
supported.
+   *
+   * @param path the file to open.
+   * @return the InputStream to read from.
+   * @throws IOException IO error.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public SeekableDataInputStream openSeekable(StoragePath path) throws 
IOException {

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569861327


##
hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java:
##
@@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException {
 assertTrue(storage.createDirectory(path4));
   }
 
+  @Test
+  public void testSeekable() throws IOException {
+HoodieStorage storage = getHoodieStorage();
+StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file");
+assertFalse(storage.exists(path));
+byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 
79};
+
+// By default, create overwrites the file
+try (OutputStream stream = storage.create(path)) {
+  stream.write(data);
+  stream.flush();
+}
+
+try (SeekableDataInputStream seekableStream = storage.openSeekable(path)) {
+  validateSeekableDataInputStream(seekableStream, data);
+}
+
+try (SeekableDataInputStream seekableStream = storage.openSeekable(path, 
2)) {
+  validateSeekableDataInputStream(seekableStream, data);
+}
+  }
+
+  private void validateSeekableDataInputStream(SeekableDataInputStream 
seekableStream,
+   byte[] expectedData) throws 
IOException {
+List positionList = new ArrayList<>();
+positionList.add(1);
+positionList.add(expectedData.length / 2);
+positionList.add(expectedData.length - 1);

Review Comment:
   They are meant to test non-contiguous and backward seeks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569860811


##
hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java:
##
@@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException {
 assertTrue(storage.createDirectory(path4));
   }
 
+  @Test
+  public void testSeekable() throws IOException {
+HoodieStorage storage = getHoodieStorage();
+StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file");
+assertFalse(storage.exists(path));
+byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 
79};

Review Comment:
   Because for values larger than 127 it has to cast from int to byte (or have 
to use negative value here, which is not going to be readable).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569860096


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws 
IOException {
 }
   }
 
+  /**
+   * Opens an SeekableDataInputStream at the indicated path with seeks 
supported.
+   *
+   * @param path the file to open.
+   * @return the InputStream to read from.
+   * @throws IOException IO error.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public SeekableDataInputStream openSeekable(StoragePath path) throws 
IOException {

Review Comment:
   I intentionally put the APIs that do not need to be implemented at the end 
of the class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch release-0.14.1-hotfix updated (4d429ac918d -> ddd0414ce97)

2024-04-17 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch release-0.14.1-hotfix
in repository https://gitbox.apache.org/repos/asf/hudi.git


 discard 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row 
writer code path (#10836)
 new ddd0414ce97 [HUDI-7489] Avoid collecting WriteStatus to driver in row 
writer code path (#10836)

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (4d429ac918d)
\
 N -- N -- N   refs/heads/release-0.14.1-hotfix (ddd0414ce97)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java| 1 +
 1 file changed, 1 insertion(+)



(hudi) 01/01: [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836)

2024-04-17 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch release-0.14.1-hotfix
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ddd0414ce9745c9cfe4c173195709fbf101c1be4
Author: Jon Vexler 
AuthorDate: Mon Mar 11 17:25:41 2024 -0700

[HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path 
(#10836)

* get rid of collect in row writer clustering

* fix race condition

* add logging

-

Co-authored-by: Jonathan Vexler <=>
---
 .../index/bucket/ConsistentBucketIndexUtils.java   | 12 ++-
 .../hudi/HoodieDatasetBulkInsertHelper.scala   | 89 +++---
 2 files changed, 56 insertions(+), 45 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
index f8befee9bf9..160688f6708 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
@@ -53,6 +53,7 @@ import java.util.stream.Collectors;
 import static 
org.apache.hudi.common.model.HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX;
 import static 
org.apache.hudi.common.model.HoodieConsistentHashingMetadata.HASHING_METADATA_FILE_SUFFIX;
 import static 
org.apache.hudi.common.model.HoodieConsistentHashingMetadata.getTimestampFromFile;
+import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
 
 /**
  * Utilities class for consistent bucket index metadata management.
@@ -208,7 +209,16 @@ public class ConsistentBucketIndexUtils {
 if (fs.exists(fullPath)) {
   return;
 }
-FileIOUtils.createFileInPath(fs, fullPath, 
Option.of(StringUtils.EMPTY_STRING.getBytes()));
+//prevent exception from race condition. We are ok with the file being 
created in another thread, so we should
+// check for the marker after catching the exception and we don't need to 
fail if the file exists
+try {
+  FileIOUtils.createFileInPath(fs, fullPath, 
Option.of(getUTF8Bytes(StringUtils.EMPTY_STRING)));
+} catch (HoodieIOException e) {
+  if (!fs.exists(fullPath)) {
+throw e;
+  }
+  LOG.warn("Failed to create marker but " + fullPath + " exists", e);
+}
   }
 
   /***
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
index 9623d4d1c09..95302e9ab37 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
@@ -26,6 +26,7 @@ import org.apache.hudi.common.engine.TaskContextSupplier
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.common.util.ReflectionUtils
 import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.data.HoodieJavaRDD
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.index.HoodieIndex.BucketIndexEngineType
 import org.apache.hudi.index.{HoodieIndex, SparkHoodieIndexFactory}
@@ -149,53 +150,53 @@ object HoodieDatasetBulkInsertHelper
  arePartitionRecordsSorted: Boolean,
  shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
 val schema = dataset.schema
-val writeStatuses = 
injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => {
-  val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
-  val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
-  val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
-  val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+HoodieJavaRDD.of(
+  injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => {
+val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
+val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
 
-  val writer = writeConfig.getIndexType match {
-case HoodieIndex.IndexType.BUCKET if 
writeConfig.getBucketIndexEngineType
-  == BucketIndexEngineType.CONSISTENT_HASHING =>
-  new ConsistentBucketBulkInsertDataInternalWriterHelper(
-table,
-writeConfig,
-instantTime,
-taskPartitionId,
-taskId,
-taskEpochId,
-schema,
-writeConfig.populateMetaFields,
-arePartitionRecordsSorted,
-shouldPres

Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569859081


##
hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopFSUtils.java:
##
@@ -154,4 +158,90 @@ public static FileStatus 
convertToHadoopFileStatus(StoragePathInfo pathInfo) {
 pathInfo.getModificationTime(),
 convertToHadoopPath(pathInfo.getPath()));
   }
+
+  /**
+   * Fetch the right {@link FSDataInputStream} to be used by wrapping with 
required input streams.
+   *
+   * @param fs instance of {@link FileSystem} in use.
+   * @param filePath   path of the file.
+   * @param bufferSize buffer size to be used.
+   * @return the right {@link FSDataInputStream} as required.
+   */
+  public static FSDataInputStream getFSDataInputStream(FileSystem fs,
+   StoragePath filePath,
+   int bufferSize) {
+FSDataInputStream fsDataInputStream = null;
+try {
+  fsDataInputStream = fs.open(new Path(filePath.toUri()), bufferSize);

Review Comment:
   Good catch.  We have the util method.  Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) 02/02: [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836)

2024-04-17 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch release-0.14.1-hotfix
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4d429ac918dae3116fbc9df420e0baccdf79ea0b
Author: Jon Vexler 
AuthorDate: Mon Mar 11 17:25:41 2024 -0700

[HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path 
(#10836)

* get rid of collect in row writer clustering

* fix race condition

* add logging

-

Co-authored-by: Jonathan Vexler <=>
---
 .../index/bucket/ConsistentBucketIndexUtils.java   | 11 ++-
 .../hudi/HoodieDatasetBulkInsertHelper.scala   | 89 +++---
 2 files changed, 55 insertions(+), 45 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
index f8befee9bf9..1e8abe09ecc 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java
@@ -208,7 +208,16 @@ public class ConsistentBucketIndexUtils {
 if (fs.exists(fullPath)) {
   return;
 }
-FileIOUtils.createFileInPath(fs, fullPath, 
Option.of(StringUtils.EMPTY_STRING.getBytes()));
+//prevent exception from race condition. We are ok with the file being 
created in another thread, so we should
+// check for the marker after catching the exception and we don't need to 
fail if the file exists
+try {
+  FileIOUtils.createFileInPath(fs, fullPath, 
Option.of(getUTF8Bytes(StringUtils.EMPTY_STRING)));
+} catch (HoodieIOException e) {
+  if (!fs.exists(fullPath)) {
+throw e;
+  }
+  LOG.warn("Failed to create marker but " + fullPath + " exists", e);
+}
   }
 
   /***
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
index 9623d4d1c09..95302e9ab37 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
@@ -26,6 +26,7 @@ import org.apache.hudi.common.engine.TaskContextSupplier
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.common.util.ReflectionUtils
 import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.data.HoodieJavaRDD
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.index.HoodieIndex.BucketIndexEngineType
 import org.apache.hudi.index.{HoodieIndex, SparkHoodieIndexFactory}
@@ -149,53 +150,53 @@ object HoodieDatasetBulkInsertHelper
  arePartitionRecordsSorted: Boolean,
  shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
 val schema = dataset.schema
-val writeStatuses = 
injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => {
-  val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
-  val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
-  val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
-  val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+HoodieJavaRDD.of(
+  injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => {
+val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
+val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
 
-  val writer = writeConfig.getIndexType match {
-case HoodieIndex.IndexType.BUCKET if 
writeConfig.getBucketIndexEngineType
-  == BucketIndexEngineType.CONSISTENT_HASHING =>
-  new ConsistentBucketBulkInsertDataInternalWriterHelper(
-table,
-writeConfig,
-instantTime,
-taskPartitionId,
-taskId,
-taskEpochId,
-schema,
-writeConfig.populateMetaFields,
-arePartitionRecordsSorted,
-shouldPreserveHoodieMetadata)
-case _ =>
-  new BulkInsertDataInternalWriterHelper(
-table,
-writeConfig,
-instantTime,
-taskPartitionId,
-taskId,
-taskEpochId,
-schema,
-writeConfig.populateMetaFields,
-arePartitionRecordsSorted,
-shouldPreserveHoodieMetadata)
-  }
+val writer = writeConfig.getIndexType match {
+  case HoodieIndex.IndexType.BUCKET if 
writeConfig.getBuc

(hudi) branch release-0.14.1-hotfix created (now 4d429ac918d)

2024-04-17 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch release-0.14.1-hotfix
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row 
writer code path (#10836)

This branch includes the following new commits:

 new a1ea1da55a8 [HUDI-7479] SQL confs don't propagate to spark row writer 
properly (#10786)
 new 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row 
writer code path (#10836)

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




(hudi) 01/02: [HUDI-7479] SQL confs don't propagate to spark row writer properly (#10786)

2024-04-17 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch release-0.14.1-hotfix
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a1ea1da55a84e3db51a0a06be99645831d6bc0f3
Author: Jon Vexler 
AuthorDate: Tue Mar 5 04:31:55 2024 -0500

[HUDI-7479] SQL confs don't propagate to spark row writer properly (#10786)
---
 .../hudi/HoodieDatasetBulkInsertHelper.scala   | 15 +-
 .../scala/org/apache/hudi/HoodieSparkUtils.scala   |  2 +-
 .../common/testutils/HoodieTestDataGenerator.java  | 15 +-
 .../deltastreamer/HoodieDeltaStreamerTestBase.java |  9 ++--
 .../deltastreamer/TestHoodieDeltaStreamer.java | 24 +-
 5 files changed, 51 insertions(+), 14 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
index 75ec069946d..9623d4d1c09 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
@@ -17,6 +17,7 @@
 
 package org.apache.hudi
 
+import org.apache.hudi.HoodieSparkUtils.injectSQLConf
 import org.apache.hudi.client.WriteStatus
 import org.apache.hudi.client.model.HoodieInternalRow
 import org.apache.hudi.common.config.TypedProperties
@@ -40,11 +41,14 @@ import 
org.apache.spark.sql.HoodieUnsafeUtils.getNumPartitions
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
 import org.apache.spark.sql.catalyst.plans.logical.Project
+import org.apache.spark.sql.execution.SQLConfInjectingRDD
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{StringType, StructField, StructType}
 import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeUtils, Row}
 import org.apache.spark.unsafe.types.UTF8String
 
 import scala.collection.JavaConverters.{asScalaBufferConverter, 
seqAsJavaListConverter}
+import scala.reflect.ClassTag
 
 object HoodieDatasetBulkInsertHelper
   extends ParallelismHelper[DataFrame](toJavaSerializableFunctionUnchecked(df 
=> getNumPartitions(df))) with Logging {
@@ -80,8 +84,8 @@ object HoodieDatasetBulkInsertHelper
   val keyGeneratorClassName = 
config.getStringOrThrow(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME,
 "Key-generator class name is required")
 
-  val prependedRdd: RDD[InternalRow] =
-df.queryExecution.toRdd.mapPartitions { iter =>
+  val prependedRdd: RDD[InternalRow] = {
+injectSQLConf(df.queryExecution.toRdd.mapPartitions { iter =>
   val typedProps = new TypedProperties(config.getProps)
   if (autoGenerateRecordKeys) {
 
typedProps.setProperty(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG, 
String.valueOf(TaskContext.getPartitionId()))
@@ -107,7 +111,8 @@ object HoodieDatasetBulkInsertHelper
 // TODO use mutable row, avoid re-allocating
 new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, 
partitionPath, filename, row, false)
   }
-}
+}, SQLConf.get)
+  }
 
   val dedupedRdd = if (config.shouldCombineBeforeInsert) {
 dedupeRows(prependedRdd, updatedSchema, config.getPreCombineField, 
SparkHoodieIndexFactory.isGlobalIndex(config))
@@ -144,7 +149,7 @@ object HoodieDatasetBulkInsertHelper
  arePartitionRecordsSorted: Boolean,
  shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
 val schema = dataset.schema
-val writeStatuses = dataset.queryExecution.toRdd.mapPartitions(iter => {
+val writeStatuses = 
injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => {
   val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
   val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
   val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
@@ -189,7 +194,7 @@ object HoodieDatasetBulkInsertHelper
   }
 
   writer.getWriteStatuses.asScala.iterator
-}).collect()
+}), SQLConf.get).collect()
 table.getContext.parallelize(writeStatuses.toList.asJava)
   }
 
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
index a0fe879b3db..92adfdd4389 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
@@ -125,7 +125,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with 
SparkVersionsSupport wi
 }, SQLConf.get)
   }
 
-  private def injectSQLConf[T: ClassTag](rdd: RDD[T], conf: SQLConf): RDD[T] =
+  def injectSQLConf[T: ClassTa

Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   For example,this could be used in unify sql engine open source or develop 
self by some business,could improve user sql such as unify casesensitivie or 
standard sql,this is not user behavior.This is a potential requirement with 
change. @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


jonvex commented on code in PR #11048:
URL: https://github.com/apache/hudi/pull/11048#discussion_r1569817603


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws 
IOException {
 }
   }
 
+  /**
+   * Opens an SeekableDataInputStream at the indicated path with seeks 
supported.
+   *
+   * @param path the file to open.
+   * @return the InputStream to read from.
+   * @throws IOException IO error.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public SeekableDataInputStream openSeekable(StoragePath path) throws 
IOException {

Review Comment:
   minor: should put this next to the other openSeekable def



##
hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java:
##
@@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException {
 assertTrue(storage.createDirectory(path4));
   }
 
+  @Test
+  public void testSeekable() throws IOException {
+HoodieStorage storage = getHoodieStorage();
+StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file");
+assertFalse(storage.exists(path));
+byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 
79};
+
+// By default, create overwrites the file
+try (OutputStream stream = storage.create(path)) {
+  stream.write(data);
+  stream.flush();
+}
+
+try (SeekableDataInputStream seekableStream = storage.openSeekable(path)) {
+  validateSeekableDataInputStream(seekableStream, data);
+}
+
+try (SeekableDataInputStream seekableStream = storage.openSeekable(path, 
2)) {
+  validateSeekableDataInputStream(seekableStream, data);
+}
+  }
+
+  private void validateSeekableDataInputStream(SeekableDataInputStream 
seekableStream,
+   byte[] expectedData) throws 
IOException {
+List positionList = new ArrayList<>();
+positionList.add(1);
+positionList.add(expectedData.length / 2);
+positionList.add(expectedData.length - 1);

Review Comment:
   why add these 3 if they will already be in the range [0,expectedData.length) 
?



##
hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopFSUtils.java:
##
@@ -154,4 +158,90 @@ public static FileStatus 
convertToHadoopFileStatus(StoragePathInfo pathInfo) {
 pathInfo.getModificationTime(),
 convertToHadoopPath(pathInfo.getPath()));
   }
+
+  /**
+   * Fetch the right {@link FSDataInputStream} to be used by wrapping with 
required input streams.
+   *
+   * @param fs instance of {@link FileSystem} in use.
+   * @param filePath   path of the file.
+   * @param bufferSize buffer size to be used.
+   * @return the right {@link FSDataInputStream} as required.
+   */
+  public static FSDataInputStream getFSDataInputStream(FileSystem fs,
+   StoragePath filePath,
+   int bufferSize) {
+FSDataInputStream fsDataInputStream = null;
+try {
+  fsDataInputStream = fs.open(new Path(filePath.toUri()), bufferSize);

Review Comment:
   What do you think about a helper method for this?:
   new Path(filePath.toUri())
   
   or will that just add more clutter?



##
hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java:
##
@@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException {
 assertTrue(storage.createDirectory(path4));
   }
 
+  @Test
+  public void testSeekable() throws IOException {
+HoodieStorage storage = getHoodieStorage();
+StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file");
+assertFalse(storage.exists(path));
+byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 
79};

Review Comment:
   why did you need to cast the values over 100?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   For example,this could be used in unify sql engine open source or develop 
self by some business,could improve user sql such as unify casesensitivie or 
standard sql,this is a potential requirement with change. @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   For example,this could be used in unify sql engine open source or develop 
self by some business,could improve user sql such as unify casesensitivie or 
standard sql,this is a potential requirement. @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569813734


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   > When parquet real schema is (day,user),but user selected schema is 
(dAy,UseR)
   
   Why there is discrepency for the field case-sensitivity or why the schema 
specified by the user is inconsistent with the table schema definition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4228) Clean up literal usage in Hudi CLI argument check

2024-04-17 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4228:
-
Fix Version/s: 1.0.0

> Clean up literal usage in Hudi CLI argument check
> -
>
> Key: HUDI-4228
> URL: https://issues.apache.org/jira/browse/HUDI-4228
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Ethan Guo
>Assignee: Vova Kolmakov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In "org.apache.hudi.cli.commands.SparkMain", the logic for checking number of 
> arguments for different Hudi CLI commands is hardcoded with literals like 
> this:
> {code:java}
> case COMPACT_RUN:
>   assert (args.length >= 10);
>   propsFilePath = null;
>   if (!StringUtils.isNullOrEmpty(args[9])) {
> propsFilePath = args[9];
>   }
>   configs = new ArrayList<>();
>   if (args.length > 10) {
> configs.addAll(Arrays.asList(args).subList(9, args.length));
>   }
>   returnCode = compact(jsc, args[3], args[4], args[5], 
> Integer.parseInt(args[6]), args[7],
>   Integer.parseInt(args[8]), HoodieCompactor.EXECUTE, propsFilePath, 
> configs);
>   break; {code}
> We should have a better way of validating this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4228) Clean up literal usage in Hudi CLI argument check

2024-04-17 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-4228.

Resolution: Fixed

Fixed via master branch: 4c04216d5cade10f9589e2c0d425109cc502ebcd

> Clean up literal usage in Hudi CLI argument check
> -
>
> Key: HUDI-4228
> URL: https://issues.apache.org/jira/browse/HUDI-4228
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Ethan Guo
>Assignee: Vova Kolmakov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In "org.apache.hudi.cli.commands.SparkMain", the logic for checking number of 
> arguments for different Hudi CLI commands is hardcoded with literals like 
> this:
> {code:java}
> case COMPACT_RUN:
>   assert (args.length >= 10);
>   propsFilePath = null;
>   if (!StringUtils.isNullOrEmpty(args[9])) {
> propsFilePath = args[9];
>   }
>   configs = new ArrayList<>();
>   if (args.length > 10) {
> configs.addAll(Arrays.asList(args).subList(9, args.length));
>   }
>   returnCode = compact(jsc, args[3], args[4], args[5], 
> Integer.parseInt(args[6]), args[7],
>   Integer.parseInt(args[8]), HoodieCompactor.EXECUTE, propsFilePath, 
> configs);
>   break; {code}
> We should have a better way of validating this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)

2024-04-17 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c04216d5ca [HUDI-4228] Clean up literal usage in Hudi CLI argument 
check (#11042)
4c04216d5ca is described below

commit 4c04216d5cade10f9589e2c0d425109cc502ebcd
Author: Vova Kolmakov 
AuthorDate: Thu Apr 18 09:14:32 2024 +0700

[HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)
---
 .../org/apache/hudi/cli/commands/SparkMain.java| 202 +++--
 .../org/apache/hudi/cli/ArchiveExecutorUtils.java  |   2 +-
 2 files changed, 69 insertions(+), 135 deletions(-)

diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
index 5e1aec52333..ba3b404474b 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
@@ -19,14 +19,12 @@
 package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.cli.ArchiveExecutorUtils;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.client.SparkRDDWriteClient;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
-import org.apache.hudi.client.timeline.HoodieTimelineArchiver;
-import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.HoodieEngineContext;
-import org.apache.hudi.common.model.HoodieAvroPayload;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.WriteOperationType;
@@ -37,7 +35,6 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.PartitionPathEncodeUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.config.HoodieArchivalConfig;
 import org.apache.hudi.config.HoodieBootstrapConfig;
 import org.apache.hudi.config.HoodieCleanConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
@@ -99,16 +96,45 @@ public class SparkMain {
* Commands.
*/
   enum SparkCommand {
-BOOTSTRAP, ROLLBACK, DEDUPLICATE, ROLLBACK_TO_SAVEPOINT, SAVEPOINT, 
IMPORT, UPSERT, COMPACT_SCHEDULE, COMPACT_RUN, COMPACT_SCHEDULE_AND_EXECUTE,
-COMPACT_UNSCHEDULE_PLAN, COMPACT_UNSCHEDULE_FILE, COMPACT_VALIDATE, 
COMPACT_REPAIR, CLUSTERING_SCHEDULE,
-CLUSTERING_RUN, CLUSTERING_SCHEDULE_AND_EXECUTE, CLEAN, DELETE_MARKER, 
DELETE_SAVEPOINT, UPGRADE, DOWNGRADE,
-REPAIR_DEPRECATED_PARTITION, RENAME_PARTITION, ARCHIVE
+BOOTSTRAP(18), ROLLBACK(6), DEDUPLICATE(8), ROLLBACK_TO_SAVEPOINT(6), 
SAVEPOINT(7),
+IMPORT(13), UPSERT(13), COMPACT_SCHEDULE(7), COMPACT_RUN(10), 
COMPACT_SCHEDULE_AND_EXECUTE(9),
+COMPACT_UNSCHEDULE_PLAN(9), COMPACT_UNSCHEDULE_FILE(10), 
COMPACT_VALIDATE(7), COMPACT_REPAIR(8),
+CLUSTERING_SCHEDULE(7), CLUSTERING_RUN(9), 
CLUSTERING_SCHEDULE_AND_EXECUTE(8), CLEAN(5),
+DELETE_MARKER(5), DELETE_SAVEPOINT(5), UPGRADE(5), DOWNGRADE(5),
+REPAIR_DEPRECATED_PARTITION(4), RENAME_PARTITION(6), ARCHIVE(8);
+
+private final int minArgsCount;
+
+SparkCommand(int minArgsCount) {
+  this.minArgsCount = minArgsCount;
+}
+
+void assertEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount == minArgsCount);
+}
+
+void assertGtEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount >= minArgsCount);
+}
+
+List makeConfigs(String[] args) {
+  List configs = new ArrayList<>();
+  if (args.length > minArgsCount) {
+configs.addAll(Arrays.asList(args).subList(minArgsCount, args.length));
+  }
+  return configs;
+}
+
+String getPropsFilePath(String[] args) {
+  return (args.length >= minArgsCount && 
!StringUtils.isNullOrEmpty(args[minArgsCount - 1]))
+  ? args[minArgsCount - 1] : null;
+}
   }
 
-  public static void main(String[] args) throws Exception {
+  public static void main(String[] args) {
 ValidationUtils.checkArgument(args.length >= 4);
 final String commandString = args[0];
-LOG.info("Invoking SparkMain: " + commandString);
+LOG.info("Invoking SparkMain: {}", commandString);
 final SparkCommand cmd = SparkCommand.valueOf(commandString);
 
 JavaSparkContext jsc = SparkUtil.initJavaSparkContext("hoodie-cli-" + 
commandString,
@@ -116,193 +142,112 @@ public class SparkMain {
 
 int returnCode = 0;
 try {
+  cmd.assertGtEq(args.length);
+  List configs = cmd.makeConfigs(args);
+  String propsFilePath = cmd.getPropsFilePath(args);
   switch (cmd) {
 case ROLLBACK:
-  assert (args.length == 6);
+  cmd.assertEq(args.l

Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]

2024-04-17 Thread via GitHub


danny0405 merged PR #11042:
URL: https://github.com/apache/hudi/pull/11042


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11031:
URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062867237

   
   ## CI report:
   
   * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN
   * 5ffea9f5122b0302cc254c88edb4fbe5feb05d39 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23321)
 
   * a7270a970988e18ba0997ee59b74678c2dc720ed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23336)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11042:
URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062861299

   
   ## CI report:
   
   * 3892e87e4b9a45c05090890cbda223a59e87eb2a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23332)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-17 Thread via GitHub


yihua commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062860564

   I need to check a few more things before landing this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11031:
URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062861117

   
   ## CI report:
   
   * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN
   * 5ffea9f5122b0302cc254c88edb4fbe5feb05d39 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23321)
 
   * a7270a970988e18ba0997ee59b74678c2dc720ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11050:
URL: https://github.com/apache/hudi/pull/11050#issuecomment-2062861431

   
   ## CI report:
   
   * f7a837d2512ae7588af5e18a596c4c58167c27cb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23334)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062860147

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN
   * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN
   * 5b08db2d947b74dc26873f4ec7c6022a52e07193 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23335)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569794748


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   OK,eg: 
   1)When parquet real schema is (Day,day,user), if caseSensitive is 
true,selected schema is (Day,day,user) would be push down to read parquet 
success  When build ParquetFileReader,this sence is OK;
   2) When parquet real schema is (day,user),but user selected schema is 
(dAy,UseR),if caseSensitive is true it would not be push down to read parquet 
success;this sence should change caseSensitive to false to fit it. @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7636] Make StoragePath Serializable [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11049:
URL: https://github.com/apache/hudi/pull/11049#issuecomment-2062855268

   
   ## CI report:
   
   * 922aaf83efb3988273d62aa2de2977ee40ff860e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1569793746


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   Maybe `listFiles` ? Should we return an iterator instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]

2024-04-17 Thread via GitHub


zhuanshenbsj1 commented on PR #11031:
URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062848895

   @danny0405 cc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7591) Implement InlineFS in HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7591:

Status: In Progress  (was: Open)

> Implement InlineFS in HoodieStorage
> ---
>
> Key: HUDI-7591
> URL: https://issues.apache.org/jira/browse/HUDI-7591
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7616) Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7616.
---
Resolution: Fixed

> Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple
> --
>
> Key: HUDI-7616
> URL: https://issues.apache.org/jira/browse/HUDI-7616
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-04-17 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7007:
-

Assignee: Sagar Sumit  (was: Vinaykumar Bhat)

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1569772713


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -242,7 +252,44 @@ protected Pair, Schema> 
getRecordsIterator(HoodieDataBlock d
 } else {
   blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext);
 }
-return Pair.of(blockRecordsIterator, dataBlock.getSchema());
+Option, Schema>> schemaEvolutionTransformerOpt =

Review Comment:
   To clarify, do we put the common schema evolution logic in the file group 
reader or the record buffer classes?  If that's the case, Spark parquet reader 
does not have to handle schema evolution, and we have common logic in the file 
group reader or the record buffer classes for schema on read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7636) Make StoragePath Serializable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7636:

Epic Link: HUDI-6243

> Make StoragePath Serializable
> -
>
> Key: HUDI-7636
> URL: https://issues.apache.org/jira/browse/HUDI-7636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7637:

Epic Link: HUDI-6243

> Make StoragePathInfo Comparable
> ---
>
> Key: HUDI-7637
> URL: https://issues.apache.org/jira/browse/HUDI-7637
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7348) Replace Configuration with StorageConfiguration for storage configuration

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7348:

Epic Link: HUDI-6243

> Replace Configuration with StorageConfiguration for storage configuration
> -
>
> Key: HUDI-7348
> URL: https://issues.apache.org/jira/browse/HUDI-7348
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7634:

Epic Link: HUDI-6243

> Rename HoodieStorage APIs
> -
>
> Key: HUDI-7634
> URL: https://issues.apache.org/jira/browse/HUDI-7634
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> getHoodieStorage -> getStorage
> listDirectEntries -> listDirectory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7633:

Epic Link: HUDI-7537

> Use try with resources for AutoCloseable
> 
>
> Key: HUDI-7633
> URL: https://issues.apache.org/jira/browse/HUDI-7633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Epic Link: HUDI-6243

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7637:

Story Points: 0.5

> Make StoragePathInfo Comparable
> ---
>
> Key: HUDI-7637
> URL: https://issues.apache.org/jira/browse/HUDI-7637
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7634:

Story Points: 0

> Rename HoodieStorage APIs
> -
>
> Key: HUDI-7634
> URL: https://issues.apache.org/jira/browse/HUDI-7634
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> getHoodieStorage -> getStorage
> listDirectEntries -> listDirectory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Story Points: 0.5

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7636) Make StoragePath Serializable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7636:

Story Points: 1

> Make StoragePath Serializable
> -
>
> Key: HUDI-7636
> URL: https://issues.apache.org/jira/browse/HUDI-7636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7633:

Story Points: 0.5

> Use try with resources for AutoCloseable
> 
>
> Key: HUDI-7633
> URL: https://issues.apache.org/jira/browse/HUDI-7633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-17 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838411#comment-17838411
 ] 

Danny Chen commented on HUDI-7596:
--

The link 
[https://www.baeldung.com/maven-jacoco-multi-module-projecthttps://www.baeldung.com/maven-jacoco-multi-module-project]
 actually turns out to be a 404.

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Danny Chen
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7616) Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7616:

Story Points: 1

> Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple
> --
>
> Key: HUDI-7616
> URL: https://issues.apache.org/jira/browse/HUDI-7616
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]

2024-04-17 Thread via GitHub


wombatu-kun commented on PR #11042:
URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062812536

   but all refactoring operations are quite trivial, also i checked 3 times 
settings of min arguments count for all commands. so i think this patch should 
not break anything for sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11048:
URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062811641

   
   ## CI report:
   
   * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN
   * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1569752704


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -231,7 +231,13 @@ private static Option findNestedField(Schema 
schema, String[] fiel
 if (!nestedPart.isPresent()) {
   return Option.empty();
 }
-return nestedPart;
+boolean isUnion = false;

Review Comment:
   Could you explain this change?  This affects the logic of all callers, not 
just schema evolution.



##
hudi-io/src/main/java/org/apache/hudi/common/util/ValidationUtils.java:
##
@@ -76,4 +76,10 @@ public static void checkState(final boolean expression, 
String errorMessage) {
   throw new IllegalStateException(errorMessage);
 }
   }
+
+  public static void checkNotNull(Object o) {
+if (o == null) {
+  throw new IllegalStateException();

Review Comment:
   Add an error message to the exception.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##
@@ -64,7 +65,7 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
  val schemaSpec: 
Option[StructType]
 ) extends SparkAdapterSupport 
with HoodieHadoopFsRelationFactory {
   protected lazy val sparkSession: SparkSession = sqlContext.sparkSession
-  protected lazy val optParams: Map[String, String] = options
+  protected var optParams: Map[String, String] = options

Review Comment:
   Can we keep this immutable and modify the input `options`?



##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -211,6 +221,10 @@ public Map 
updateSchemaAndResetOrderingValInMetadata(Map projectRecord(Schema from, Schema to);
 
+  public UnaryOperator projectRecordUnsafe(Schema from, Schema to, 
Map renamedColumns) {

Review Comment:
   Can this be unified with `projectRecord` instead of adding a new one?



##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -48,6 +49,12 @@
  *and {@code RowData} in Flink.
  */
 public abstract class HoodieReaderContext {
+  protected HoodieFileGroupReaderState readerState = new 
HoodieFileGroupReaderState<>();
+
+  public HoodieFileGroupReaderState getReaderState() {
+return readerState;
+  }
+

Review Comment:
   Can this be passed in and immutable?  It's bad idea to modify this just for 
tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062810967

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN
   * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN
   * 888fc20e7e8ef343091fa88645b96a959fa1c662 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23325)
 
   * 5b08db2d947b74dc26873f4ec7c6022a52e07193 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23335)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]

2024-04-17 Thread via GitHub


wombatu-kun commented on PR #11042:
URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062809664

   > @wombatu-kun Did you have any chance to test the HUDI CLI locally with 
your patch?
   
   no, unfortunately i don't have an environment to test it locally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062805303

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN
   * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN
   * 888fc20e7e8ef343091fa88645b96a959fa1c662 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23325)
 
   * 5b08db2d947b74dc26873f4ec7c6022a52e07193 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-17 Thread via GitHub


hudi-bot commented on PR #11047:
URL: https://github.com/apache/hudi/pull/11047#issuecomment-2062799489

   
   ## CI report:
   
   * ce538c2a343e865608bb8946b9fdc2aa1c383e71 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23330)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1569730199


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -71,24 +64,36 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   isIncremental: Boolean,
   shouldUseRecordPosition: 
Boolean,
   requiredFilters: Seq[Filter]
-   ) extends ParquetFileFormat with 
SparkAdapterSupport with HoodieFormatTrait {
+ ) extends ParquetFileFormat 
with SparkAdapterSupport with HoodieFormatTrait {
 
   def getRequiredFilters: Seq[Filter] = requiredFilters
 
+  private val sanitizedTableName = 
AvroSchemaUtils.getAvroRecordQualifiedName(tableName)
+
   /**
* Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
* while the other side can't
*/
-  private var supportBatchCalled = false
-  private var supportBatchResult = false
+  /*
+private var supportBatchCalled = false
+private var supportBatchResult = false
+
+override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+  if (!supportBatchCalled || supportBatchResult) {
+supportBatchCalled = true
+supportBatchResult = tableSchema.internalSchema.isEmpty && !isMOR && 
!isIncremental && !isBootstrap && super.supportBatch(sparkSession, schema)
+  }
+  supportBatchResult
+}
+ */
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = false

Review Comment:
   Are these still needed? If not, remove the lines?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala:
##
@@ -80,16 +79,17 @@ class TestHoodieFileGroupReaderOnSpark extends 
TestHoodieFileGroupReaderBase[Int
 tempDir.toAbsolutePath.toUri.toString
   }
 
-  override def getHoodieReaderContext(tablePath: String, avroSchema: Schema): 
HoodieReaderContext[InternalRow] = {
-val parquetFileFormat = new ParquetFileFormat
-val structTypeSchema = 
AvroConversionUtils.convertAvroSchemaToStructType(avroSchema)
-
-val recordReaderIterator = 
parquetFileFormat.buildReaderWithPartitionValues(
-  spark, structTypeSchema, StructType(Seq.empty), structTypeSchema, 
Seq.empty, Map.empty, getHadoopConf)
-
-val m = scala.collection.mutable.Map[Long, PartitionedFile => 
Iterator[InternalRow]]()
-m.put(2*avroSchema.hashCode(), recordReaderIterator)
-new SparkFileFormatInternalRowReaderContext(m)
+  override def getHoodieReaderContext(tablePath: String, avroSchema: Schema, 
hadoopConf: Configuration): HoodieReaderContext[InternalRow] = {
+val reader = sparkAdapter.createParquetFileReader(vectorized = false, 
spark.sessionState.conf, Map.empty, hadoopConf)
+val metaClient = 
HoodieTableMetaClient.builder().setConf(getHadoopConf).setBasePath(tablePath).build
+val recordKeyField = if (metaClient.getTableConfig.populateMetaFields()) {
+  HoodieRecord.RECORD_KEY_METADATA_FIELD
+} else {
+  val keyFields = metaClient.getTableConfig.getRecordKeyFields.get()
+  checkState(keyFields.length == 1)
+  keyFields.head
+}

Review Comment:
   Could this be wrapped into a util method?  I think virtual key support may 
have already added such method.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -107,19 +112,23 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
 val dataSchema = 
StructType(tableSchema.structTypeSchema.fields.filterNot(f => 
partitionColumns.contains(f.name)))
 val outputSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields)
 spark.conf.set("spark.sql.parquet.enableVectorizedReader", 
supportBatchResult)
-val requiredSchemaWithMandatory = 
generateRequiredSchemaWithMandatory(requiredSchema, dataSchema, partitionSchema)
-val isCount = requiredSchemaWithMandatory.isEmpty
-val requiredSchemaSplits = requiredSchemaWithMandatory.fields.partition(f 
=> HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name))
-val requiredMeta = StructType(requiredSchemaSplits._1)
-val requiredWithoutMeta = StructType(requiredSchemaSplits._2)
+val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental
 val augmentedHadoopConf = FSUtils.buildInlineConf(hadoopConf)
-val (baseFileReader, preMergeBaseFileReader, readerMaps, cdcFileReader) = 
buildFileReaders(
-  spark, dataSchema, partitionSchema, requiredSchema, filters, options, 
augmentedHadoopConf,
-  requ

Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on PR #11042:
URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062796467

   @wombatu-kun Did you have any chance to test the HUDI CLI locally with your 
patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1569738186


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   As mentioned above, please give an illustrative example here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on code in PR #11008:
URL: https://github.com/apache/hudi/pull/11008#discussion_r1569737390


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java:
##
@@ -89,10 +89,13 @@ protected List getSmallFiles(String 
partitionPath) {
   private List getSmallFileCandidates(String partitionPath, 
HoodieInstant latestCommitInstant) {
 // If we can index log files, we can add more inserts to log files for 
fileIds NOT including those under
 // pending compaction
+Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice))
+.thenComparing(FileSlice::getFileId);
 if (table.getIndex().canIndexLogFiles()) {
   return table.getSliceView()
   .getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitInstant.getTimestamp(), false)
   .filter(this::isSmallFile)
+  .sorted(comparator)
   .collect(Collectors.toList());

Review Comment:
   Fine, just fix all the test failures and let's see what use cases are 
affected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Order rows with same key before precombine [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on issue #11041:
URL: https://github.com/apache/hudi/issues/11041#issuecomment-2062788935

   Currently only internal `HFile` enables the sort of payloads within a file, 
for PARQUETs in the dataset table, the merge would break the sequence in 
anyway. Take 
https://github.com/apache/hudi/blob/6c6bddcef3ec383b08eb10f10ab0400f4edc41f4/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java#L54
 for a reference.
   
   I guess you might want to make the sorting configurable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]

2024-04-17 Thread via GitHub


danny0405 commented on issue #10914:
URL: https://github.com/apache/hudi/issues/10914#issuecomment-2062778638

   > Would the checkpoint of job 3 contains all index information retrieved 
from the index bootstrap process in job 2? Asking this as I noticed a 
significant size differences between the checkpoint of job 2 and job 3. (500GB 
in job 2 vs < 50GB in job 3)
   
   yes, one successful checkpoint indicates the bootstrap has finished.
   
   > If job 3 fails and I need to start a job 4 using job 3's latest 
checkpoint, do I need to have index bootstrap enabled?
   
   No need to do that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-17 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1569721067


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -138,6 +138,7 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
   spark.sessionState.catalog.dropTable(TableIdentifier(tableName), 
true, true)
   spark.sessionState.catalog.refreshTable(TableIdentifier(tableName))
   
spark.sessionState.conf.unsetConf(DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION.key)
+  spark.sessionState.conf.unsetConf("spark.sql.storeAssignmentPolicy")

Review Comment:
   What is this used for?



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionBase.java:
##
@@ -156,7 +156,7 @@ protected HoodieDeltaStreamer.Config 
getDeltaStreamerConfig(String[] transformer
 
   protected HoodieDeltaStreamer.Config getDeltaStreamerConfig(String[] 
transformerClasses, boolean nullForDeletedCols,
   TypedProperties 
extraProps) throws IOException {
-extraProps.setProperty(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), 
"false");
+extraProps.setProperty(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), 
"true");

Review Comment:
   Completely remove this line as file group reader is turned on by default?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -706,6 +709,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
   }
 
   test("Test schema auto evolution") {
+//This test will be flakey for mor until [HUDI-6798] is landed and we can 
set the merge mode

Review Comment:
   How does the test pass for MOR before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7633:

Status: In Progress  (was: Open)

> Use try with resources for AutoCloseable
> 
>
> Key: HUDI-7633
> URL: https://issues.apache.org/jira/browse/HUDI-7633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7633:

Status: Patch Available  (was: In Progress)

> Use try with resources for AutoCloseable
> 
>
> Key: HUDI-7633
> URL: https://issues.apache.org/jira/browse/HUDI-7633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7636) Make StoragePath Serializable

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7636:

Labels: hoodie-storage pull-request-available  (was: pull-request-available)

> Make StoragePath Serializable
> -
>
> Key: HUDI-7636
> URL: https://issues.apache.org/jira/browse/HUDI-7636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7635:
---

Assignee: Ethan Guo

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Fix Version/s: 0.15.0

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Status: In Progress  (was: Open)

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7634:

Status: In Progress  (was: Open)

> Rename HoodieStorage APIs
> -
>
> Key: HUDI-7634
> URL: https://issues.apache.org/jira/browse/HUDI-7634
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> getHoodieStorage -> getStorage
> listDirectEntries -> listDirectory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Labels: hoodie-storage pull-request-available  (was: pull-request-available)

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >