Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1570022625 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: Kind of think people would be confused by the two `listFiles` and `listDirectory`, the former is to list the given path and return all the file paths contained underneath it, the later is to list the **directory** which means the given path must be a directory(do we have validation for that)? The naming convention is differnet when comparing these two. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]
yihua commented on code in PR #11050: URL: https://github.com/apache/hudi/pull/11050#discussion_r1570007033 ## hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java: ## @@ -109,6 +109,11 @@ public long getModificationTime() { return modificationTime; } + @Override + public int compareTo(StoragePathInfo o) { +return this.getPath().compareTo(o.getPath()); Review Comment: There is no need for null check; path should always exist. We shouldn’t have a case for multiple path info instances with the same path, in the same file listing call. So this should be OK. And this is only used in Hudi CLI for sorting purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
yihua commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1570003488 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: The renaming is based on the suggestion: https://github.com/apache/hudi/pull/10591#discussion_r1484908969. I think we don’t have to conform to the naming in Hadoop and I feel `listDirectory` is a better naming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]
yihua commented on code in PR #10352: URL: https://github.com/apache/hudi/pull/10352#discussion_r1569988592 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -651,6 +652,36 @@ public static Stream createColumnStatsRecords(String partitionName }); } + public static Stream createPartitionStatsRecords(String partitionPath, + Collection> columnRangeMetadataList, + boolean isDeleted) { +return columnRangeMetadataList.stream().map(columnRangeMetadata -> { + HoodieKey key = new HoodieKey(getPartitionStatsIndexKey(partitionPath, columnRangeMetadata), + MetadataPartitionType.PARTITION_STATS.getPartitionPath()); + + HoodieMetadataPayload payload = new HoodieMetadataPayload(key.getRecordKey(), + HoodieMetadataColumnStats.newBuilder() + .setFileName(null) + .setColumnName(columnRangeMetadata.getColumnName()) + .setMinValue(wrapValueIntoAvro(columnRangeMetadata.getMinValue())) + .setMaxValue(wrapValueIntoAvro(columnRangeMetadata.getMaxValue())) + .setNullCount(columnRangeMetadata.getNullCount()) + .setValueCount(columnRangeMetadata.getValueCount()) + .setTotalSize(columnRangeMetadata.getTotalSize()) + .setTotalUncompressedSize(columnRangeMetadata.getTotalUncompressedSize()) + .setIsDeleted(isDeleted) + .build()); + + return new HoodieAvroRecord<>(key, payload); +}); + } + + public static String getPartitionStatsIndexKey(String partitionPath, HoodieColumnRangeMetadata columnRangeMetadata) { Review Comment: nit: have the column name as the method argument directly to avoid confusion? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -351,6 +356,12 @@ case class HoodieFileIndex(spark: SparkSession, Option.empty } else if (recordKeys.nonEmpty) { Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), recordKeys)) +} else if (partitionStatsIndex.isIndexAvailable && queryFilters.nonEmpty) { + val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices) + val shouldReadInMemory = partitionStatsIndex.shouldReadInMemory(this, queryReferencedColumns) + partitionStatsIndex.loadTransposed(queryReferencedColumns, shouldReadInMemory) { transposedColStatsDF => Review Comment: Could we avoid transposing here as it has overhead and directly leverage the index entry for filtering? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
liangchen-datanerd commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2063009955 @ad1happy2go Based on the IoT scenario on which I've been working, the event time would be adopted as the partition column. At the same time, we would query data based on the original timestamp event time, not the transformed partition path. Implementing this feature would be great help. Should I close this issue or leave it open? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]
yihua commented on code in PR #10352: URL: https://github.com/apache/hudi/pull/10352#discussion_r1569972641 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -224,6 +224,10 @@ private void enablePartitions() { if (dataMetaClient.getFunctionalIndexMetadata().isPresent()) { this.enabledPartitionTypes.add(FUNCTIONAL_INDEX); } +if ((metadataConfig.isPartitionStatsIndexEnabled() && !metadataConfig.getColumnsEnabledForColumnStatsIndex().isEmpty()) +|| dataMetaClient.getTableConfig().isMetadataPartitionAvailable(PARTITION_STATS)) { + this.enabledPartitionTypes.add(PARTITION_STATS); Review Comment: Can this part be generalized too, i.e., relying on the defined partition type enums to automatically check the corresponding write config and enable the index, instead of modifying the method manually? ## hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java: ## @@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat fileFormat) { throw new UnsupportedOperationException(fileFormat.name() + " format not supported yet."); } + /** + * Aggregate column range statistics across files in a partition. + * + * @param fileRanges List of column range statistics for each file in a partition + */ + public static > HoodieColumnRangeMetadata getColumnRangeInPartition(@Nonnull List> fileRanges) { +if (fileRanges.size() == 1) { + // Only one parquet file, we can just return that range. + return fileRanges.get(0); +} +// There are multiple files. Compute min(file_mins) and max(file_maxs) +return fileRanges.stream() +.sequential() +.reduce(BaseFileUtils::mergeRanges).get(); + } + + private static > HoodieColumnRangeMetadata mergeRanges(HoodieColumnRangeMetadata one, Review Comment: ```suggestion private static > HoodieColumnRangeMetadata mergeRanges(HoodieColumnRangeMetadata one, ``` ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1580,69 +1595,4 @@ public boolean isInitialized() { } protected abstract BaseHoodieWriteClient initializeWriteClient(); - - /** - * A class which represents a directory and the files and directories inside it. - * - * A {@code PartitionFileInfo} object saves the name of the partition and various properties requires of each file - * required for initializing the metadata table. Saving limited properties reduces the total memory footprint when - * a very large number of files are present in the dataset being initialized. - */ Review Comment: So this is no longer needed? ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java: ## @@ -557,7 +558,8 @@ public void testDowngradeSixToFiveShouldDeleteRecordIndexPartition() throws Exce PARTITION_NAME_COLUMN_STATS, PARTITION_NAME_BLOOM_FILTERS, PARTITION_NAME_RECORD_INDEX, -PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX +PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX, Review Comment: Should this list be generated from the production code, i.e., list of supported MDT partitions? Also, do we need to upgrade the table version? I think master branch is still on table version 6, the same as 0.14.0 release. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -104,13 +104,14 @@ public class HoodieMetadataPayload implements HoodieRecordPayloadhttps://issues.apache.org/jira/browse/AVRO-1810 + * has a bug - https://issues.apache.org/jira/browse/AVRO-1810";>... */ - protected static final int METADATA_TYPE_PARTITION_LIST = 1; - protected static final int METADATA_TYPE_FILE_LIST = 2; - protected static final int METADATA_TYPE_COLUMN_STATS = 3; - protected static final int METADATA_TYPE_BLOOM_FILTER = 4; + private static final int METADATA_TYPE_PARTITION_LIST = 1; + private static final int METADATA_TYPE_FILE_LIST = 2; + private static final int METADATA_TYPE_COLUMN_STATS = 3; + private static final int METADATA_TYPE_BLOOM_FILTER = 4; private static final int METADATA_TYPE_RECORD_INDEX = 5; + private static final int METADATA_TYPE_PARTITION_STATS = 6; Review Comment: Maybe we should add enum and guarantee the ordering, and automatically assign the type ID? ## hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java: ## @@ -67,6 +70,50 @@ public static BaseFileUtils getInstance(HoodieFileFormat fileFormat) { throw new UnsupportedOperationException(fileFormat.name() + " format not supported yet."); } + /** + * Aggregate column range statistics across files in a partition. + * + * @param fileRanges List of column range statistics for each file in a partition + */ + public
Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]
hudi-bot commented on PR #11032: URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062998871 ## CI report: * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23339) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (999007b39d5 -> b3773657f69)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 999007b39d5 [HUDI-7636] Make StoragePath Serializable (#11049) add b3773657f69 [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage (#11048) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/common/fs/FSUtils.java| 19 - .../hudi/common/table/log/HoodieLogFileReader.java | 75 +- .../org/apache/hudi/hadoop/fs/HadoopFSUtils.java | 90 ++ .../hudi/storage/hadoop/HoodieHadoopStorage.java | 13 .../org/apache/hudi/storage/HoodieStorage.java | 30 .../hudi/io/storage/TestHoodieStorageBase.java | 43 +++ 6 files changed, 179 insertions(+), 91 deletions(-)
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
codope merged PR #11048: URL: https://github.com/apache/hudi/pull/11048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]
codope commented on code in PR #11050: URL: https://github.com/apache/hudi/pull/11050#discussion_r1569960431 ## hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java: ## @@ -109,6 +109,11 @@ public long getModificationTime() { return modificationTime; } + @Override + public int compareTo(StoragePathInfo o) { +return this.getPath().compareTo(o.getPath()); Review Comment: Practically, would it ever hit a scenario when this.getPath() == o.getPath()? If so, do need to break ties? If not, should we add a validation and throw IllegalStateException? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]
codope commented on code in PR #11050: URL: https://github.com/apache/hudi/pull/11050#discussion_r1569961073 ## hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java: ## @@ -109,6 +109,11 @@ public long getModificationTime() { return modificationTime; } + @Override + public int compareTo(StoragePathInfo o) { +return this.getPath().compareTo(o.getPath()); Review Comment: Also, is there a need for null check? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (df5cdb7f290 -> 999007b39d5)
This is an automated email from the ASF dual-hosted git repository. codope pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from df5cdb7f290 [MINOR] Remove redundant TestStringUtils in hudi-common (#11046) add 999007b39d5 [HUDI-7636] Make StoragePath Serializable (#11049) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/storage/StoragePath.java | 14 +-- .../apache/hudi/io/storage/TestStoragePath.java| 28 +- 2 files changed, 39 insertions(+), 3 deletions(-)
Re: [PR] [HUDI-7636] Make StoragePath Serializable [hudi]
codope merged PR #11049: URL: https://github.com/apache/hudi/pull/11049 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Remove redundant TestStringUtils in hudi-common (#11046)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new df5cdb7f290 [MINOR] Remove redundant TestStringUtils in hudi-common (#11046) df5cdb7f290 is described below commit df5cdb7f290835c7d3c72debea39a17cb5d66e2c Author: Y Ethan Guo AuthorDate: Wed Apr 17 21:34:06 2024 -0700 [MINOR] Remove redundant TestStringUtils in hudi-common (#11046) --- .../apache/hudi/common/util/TestStringUtils.java | 124 - 1 file changed, 124 deletions(-) diff --git a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java deleted file mode 100644 index 54985056bf0..000 --- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java +++ /dev/null @@ -1,124 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.hudi.common.util; - -import org.junit.jupiter.api.Test; - -import java.nio.ByteBuffer; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; - -import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; -import static org.junit.jupiter.api.Assertions.assertEquals; -import static org.junit.jupiter.api.Assertions.assertNotEquals; -import static org.junit.jupiter.api.Assertions.assertNull; -import static org.junit.jupiter.api.Assertions.assertTrue; - -/** - * Tests {@link StringUtils}. - */ -public class TestStringUtils { - - private static final String[] STRINGS = {"This", "is", "a", "test"}; - - @Test - public void testStringJoinWithDelim() { -String joinedString = StringUtils.joinUsingDelim("-", STRINGS); -assertEquals(STRINGS.length, joinedString.split("-").length); - } - - @Test - public void testStringJoin() { -assertNotEquals(null, StringUtils.join("")); -assertNotEquals(null, StringUtils.join(STRINGS)); - } - - @Test - public void testStringJoinWithJavaImpl() { -assertNull(StringUtils.join(",", null)); -assertEquals("", String.join(",", Collections.singletonList(""))); -assertEquals(",", String.join(",", Arrays.asList("", ""))); -assertEquals("a,", String.join(",", Arrays.asList("a", ""))); - } - - @Test - public void testStringNullToEmpty() { -String str = "This is a test"; -assertEquals(str, StringUtils.nullToEmpty(str)); -assertEquals("", StringUtils.nullToEmpty(null)); - } - - @Test - public void testStringObjToString() { -assertNull(StringUtils.objToString(null)); -assertEquals("Test String", StringUtils.objToString("Test String")); - -// assert byte buffer -ByteBuffer byteBuffer1 = ByteBuffer.wrap(getUTF8Bytes("1234")); -ByteBuffer byteBuffer2 = ByteBuffer.wrap(getUTF8Bytes("5678")); -// assert equal because ByteBuffer has overwritten the toString to return a summary string -assertEquals(byteBuffer1.toString(), byteBuffer2.toString()); -// assert not equal -assertNotEquals(StringUtils.objToString(byteBuffer1), StringUtils.objToString(byteBuffer2)); - } - - @Test - public void testStringEmptyToNull() { -assertNull(StringUtils.emptyToNull("")); -assertEquals("Test String", StringUtils.emptyToNull("Test String")); - } - - @Test - public void testStringNullOrEmpty() { -assertTrue(StringUtils.isNullOrEmpty(null)); -assertTrue(StringUtils.isNullOrEmpty("")); -assertNotEquals(null, StringUtils.isNullOrEmpty("this is not empty")); -assertTrue(StringUtils.isNullOrEmpty("")); - } - - @Test - public void testSplit() { -assertEquals(new ArrayList<>(), StringUtils.split(null, ",")); -assertEquals(new ArrayList<>(), StringUtils.split("", ",")); -assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b, c", ",")); -assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b,, c ", ",")); - } - - @Test - public void testHexString() { -String str = "abcd"; -assertEquals(StringUtils.toHexString(getUTF8Bytes(str)), toHexString(getUTF8Bytes(str))); - } - - private static String toHexString(byte[
Re: [PR] [MINOR] Remove redundant TestStringUtils in hudi-common module [hudi]
codope merged PR #11046: URL: https://github.com/apache/hudi/pull/11046 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7633] Use try with resources for AutoCloseable (#11045)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 78fc62554ce [HUDI-7633] Use try with resources for AutoCloseable (#11045) 78fc62554ce is described below commit 78fc62554ce3798eb803332b42f84f9cfa74e526 Author: Y Ethan Guo AuthorDate: Wed Apr 17 21:31:44 2024 -0700 [HUDI-7633] Use try with resources for AutoCloseable (#11045) --- .../hudi/cli/commands/ArchivedCommitsCommand.java | 104 .../apache/hudi/cli/commands/ExportCommand.java| 93 +++--- .../hudi/cli/commands/HoodieLogFileCommand.java| 104 .../org/apache/hudi/cli/commands/TableCommand.java | 6 +- .../metadata/HoodieBackedTableMetadataWriter.java | 8 +- .../hudi/common/model/HoodiePartitionMetadata.java | 8 +- .../hudi/common/table/log/LogReaderUtils.java | 22 ++-- .../table/log/block/HoodieAvroDataBlock.java | 135 ++--- .../hudi/common/util/SerializationUtils.java | 6 +- .../hudi/metadata/HoodieBackedTableMetadata.java | 24 ++-- .../java/HoodieJavaWriteClientExample.java | 70 +-- .../examples/spark/HoodieWriteClientExample.java | 90 +++--- .../org/apache/hudi/common/util/FileIOUtils.java | 14 +-- .../hudi/utilities/HoodieCompactionAdminTool.java | 9 +- .../utilities/streamer/SchedulerConfGenerator.java | 6 +- 15 files changed, 344 insertions(+), 355 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java index 075a57d541c..5c57c8f5288 100644 --- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java @@ -114,47 +114,46 @@ public class ArchivedCommitsCommand { List allStats = new ArrayList<>(); for (FileStatus fs : fsStatuses) { // read the archived file - Reader reader = HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, HoodieCLI.conf), - new HoodieLogFile(fs.getPath()), HoodieArchivedMetaEntry.getClassSchema()); - - List readRecords = new ArrayList<>(); - // read the avro blocks - while (reader.hasNext()) { -HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next(); -blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> readRecords.add((IndexedRecord) r.getData())); + try (Reader reader = HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, HoodieCLI.conf), + new HoodieLogFile(fs.getPath()), HoodieArchivedMetaEntry.getClassSchema())) { +List readRecords = new ArrayList<>(); +// read the avro blocks +while (reader.hasNext()) { + HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next(); + blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> readRecords.add((IndexedRecord) r.getData())); +} +List readCommits = readRecords.stream().map(r -> (GenericRecord) r) +.filter(r -> r.get("actionType").toString().equals(HoodieTimeline.COMMIT_ACTION) +|| r.get("actionType").toString().equals(HoodieTimeline.DELTA_COMMIT_ACTION)) +.flatMap(r -> { + HoodieCommitMetadata metadata = (HoodieCommitMetadata) SpecificData.get() + .deepCopy(HoodieCommitMetadata.SCHEMA$, r.get("hoodieCommitMetadata")); + final String instantTime = r.get("commitTime").toString(); + final String action = r.get("actionType").toString(); + return metadata.getPartitionToWriteStats().values().stream().flatMap(hoodieWriteStats -> hoodieWriteStats.stream().map(hoodieWriteStat -> { +List row = new ArrayList<>(); +row.add(action); +row.add(instantTime); +row.add(hoodieWriteStat.getPartitionPath()); +row.add(hoodieWriteStat.getFileId()); +row.add(hoodieWriteStat.getPrevCommit()); +row.add(hoodieWriteStat.getNumWrites()); +row.add(hoodieWriteStat.getNumInserts()); +row.add(hoodieWriteStat.getNumDeletes()); +row.add(hoodieWriteStat.getNumUpdateWrites()); +row.add(hoodieWriteStat.getTotalLogFiles()); +row.add(hoodieWriteStat.getTotalLogBlocks()); +row.add(hoodieWriteStat.getTotalCorruptLogBlock()); +row.add(hoodieWriteStat.getTotalRollbackBlocks()); +row.add(hoodieWriteStat.getTotalLogRecords()); +row.add(hoodieWriteStat.getTotalUpdatedRecordsCompacted()); +row.add(hoodieWriteStat.getTotalWriteBytes()); +row.add(hoodieWriteStat.getTo
Re: [PR] [HUDI-7633] Use try with resources for AutoCloseable [hudi]
codope merged PR #11045: URL: https://github.com/apache/hudi/pull/11045 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
hudi-bot commented on PR #11048: URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062959240 ## CI report: * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN * b3e4c349009dd1cba0181536563f51bdb8c4eb11 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23338) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569903965 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: Then maybe the name `listPathInfo` sounds more proper? Because the `listXXX` API indicates that the `XXX` is actually the item type of the return value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569903965 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: Then maybe the name `listPathInfo` sounds more proper? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
yihua commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569884258 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: The `FileSystem#listStatus` API: https://github.com/apache/hadoop/blob/branch-2.10.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1738 ``` /** * List the statuses of the files/directories in the given path if the path is * a directory. * * Does not guarantee to return the List of files/directories status in a * sorted order. * @param f given path * @return the statuses of the files/directories in the given patch * @throws FileNotFoundException when the path does not exist * @throws IOException see specific implementation */ public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
hudi-bot commented on PR #11048: URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062919526 ## CI report: * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331) * b3e4c349009dd1cba0181536563f51bdb8c4eb11 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23338) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]
hudi-bot commented on PR #11031: URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062919442 ## CI report: * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN * a7270a970988e18ba0997ee59b74678c2dc720ed Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23336) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]
hudi-bot commented on PR #11032: URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062919460 ## CI report: * 254fbf794c65c5d54251f388ec7ea8fdbae29d03 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23303) * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23339) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
danny0405 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569877853 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: > this is not user behavior.This is a potential requirement with change But it seems the DDL and DML handle the case-sensitivity in different ways, which is not plausible here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569876744 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: What is the Hadoop API that we can reference? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
hudi-bot commented on PR #11048: URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062914281 ## CI report: * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331) * b3e4c349009dd1cba0181536563f51bdb8c4eb11 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7623] Refactoring of RemoteHoodieTableFileSystemView and RequestHandler [hudi]
hudi-bot commented on PR #11032: URL: https://github.com/apache/hudi/pull/11032#issuecomment-2062914221 ## CI report: * 254fbf794c65c5d54251f388ec7ea8fdbae29d03 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23303) * 0e9a6ddd0123ac7892774a53dab39fe4947d6019 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
yihua commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569866172 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws IOException { } } + /** + * Opens an SeekableDataInputStream at the indicated path with seeks supported. + * + * @param path the file to open. + * @return the InputStream to read from. + * @throws IOException IO error. + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + public SeekableDataInputStream openSeekable(StoragePath path) throws IOException { Review Comment: I think this one can still be overridden if the particular storage implementation provides optimization on this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
jonvex commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569864095 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws IOException { } } + /** + * Opens an SeekableDataInputStream at the indicated path with seeks supported. + * + * @param path the file to open. + * @return the InputStream to read from. + * @throws IOException IO error. + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + public SeekableDataInputStream openSeekable(StoragePath path) throws IOException { Review Comment: should it be final then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
yihua commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569863508 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: There is already a `listFiles` API which lists all files under the directory recursively, which provides a different semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
yihua commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569864294 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: The counterpart Hadoop API returns an array so I think we can keep the return type as List for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
jonvex commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569863920 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws IOException { } } + /** + * Opens an SeekableDataInputStream at the indicated path with seeks supported. + * + * @param path the file to open. + * @return the InputStream to read from. + * @throws IOException IO error. + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + public SeekableDataInputStream openSeekable(StoragePath path) throws IOException { Review Comment: ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
yihua commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569861327 ## hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java: ## @@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException { assertTrue(storage.createDirectory(path4)); } + @Test + public void testSeekable() throws IOException { +HoodieStorage storage = getHoodieStorage(); +StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file"); +assertFalse(storage.exists(path)); +byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 79}; + +// By default, create overwrites the file +try (OutputStream stream = storage.create(path)) { + stream.write(data); + stream.flush(); +} + +try (SeekableDataInputStream seekableStream = storage.openSeekable(path)) { + validateSeekableDataInputStream(seekableStream, data); +} + +try (SeekableDataInputStream seekableStream = storage.openSeekable(path, 2)) { + validateSeekableDataInputStream(seekableStream, data); +} + } + + private void validateSeekableDataInputStream(SeekableDataInputStream seekableStream, + byte[] expectedData) throws IOException { +List positionList = new ArrayList<>(); +positionList.add(1); +positionList.add(expectedData.length / 2); +positionList.add(expectedData.length - 1); Review Comment: They are meant to test non-contiguous and backward seeks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
yihua commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569860811 ## hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java: ## @@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException { assertTrue(storage.createDirectory(path4)); } + @Test + public void testSeekable() throws IOException { +HoodieStorage storage = getHoodieStorage(); +StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file"); +assertFalse(storage.exists(path)); +byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 79}; Review Comment: Because for values larger than 127 it has to cast from int to byte (or have to use negative value here, which is not going to be readable). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
yihua commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569860096 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws IOException { } } + /** + * Opens an SeekableDataInputStream at the indicated path with seeks supported. + * + * @param path the file to open. + * @return the InputStream to read from. + * @throws IOException IO error. + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + public SeekableDataInputStream openSeekable(StoragePath path) throws IOException { Review Comment: I intentionally put the APIs that do not need to be implemented at the end of the class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch release-0.14.1-hotfix updated (4d429ac918d -> ddd0414ce97)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch release-0.14.1-hotfix in repository https://gitbox.apache.org/repos/asf/hudi.git discard 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) new ddd0414ce97 [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (4d429ac918d) \ N -- N -- N refs/heads/release-0.14.1-hotfix (ddd0414ce97) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java| 1 + 1 file changed, 1 insertion(+)
(hudi) 01/01: [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch release-0.14.1-hotfix in repository https://gitbox.apache.org/repos/asf/hudi.git commit ddd0414ce9745c9cfe4c173195709fbf101c1be4 Author: Jon Vexler AuthorDate: Mon Mar 11 17:25:41 2024 -0700 [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) * get rid of collect in row writer clustering * fix race condition * add logging - Co-authored-by: Jonathan Vexler <=> --- .../index/bucket/ConsistentBucketIndexUtils.java | 12 ++- .../hudi/HoodieDatasetBulkInsertHelper.scala | 89 +++--- 2 files changed, 56 insertions(+), 45 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java index f8befee9bf9..160688f6708 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java @@ -53,6 +53,7 @@ import java.util.stream.Collectors; import static org.apache.hudi.common.model.HoodieConsistentHashingMetadata.HASHING_METADATA_COMMIT_FILE_SUFFIX; import static org.apache.hudi.common.model.HoodieConsistentHashingMetadata.HASHING_METADATA_FILE_SUFFIX; import static org.apache.hudi.common.model.HoodieConsistentHashingMetadata.getTimestampFromFile; +import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; /** * Utilities class for consistent bucket index metadata management. @@ -208,7 +209,16 @@ public class ConsistentBucketIndexUtils { if (fs.exists(fullPath)) { return; } -FileIOUtils.createFileInPath(fs, fullPath, Option.of(StringUtils.EMPTY_STRING.getBytes())); +//prevent exception from race condition. We are ok with the file being created in another thread, so we should +// check for the marker after catching the exception and we don't need to fail if the file exists +try { + FileIOUtils.createFileInPath(fs, fullPath, Option.of(getUTF8Bytes(StringUtils.EMPTY_STRING))); +} catch (HoodieIOException e) { + if (!fs.exists(fullPath)) { +throw e; + } + LOG.warn("Failed to create marker but " + fullPath + " exists", e); +} } /*** diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala index 9623d4d1c09..95302e9ab37 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala @@ -26,6 +26,7 @@ import org.apache.hudi.common.engine.TaskContextSupplier import org.apache.hudi.common.model.HoodieRecord import org.apache.hudi.common.util.ReflectionUtils import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.data.HoodieJavaRDD import org.apache.hudi.exception.HoodieException import org.apache.hudi.index.HoodieIndex.BucketIndexEngineType import org.apache.hudi.index.{HoodieIndex, SparkHoodieIndexFactory} @@ -149,53 +150,53 @@ object HoodieDatasetBulkInsertHelper arePartitionRecordsSorted: Boolean, shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = { val schema = dataset.schema -val writeStatuses = injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => { - val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier - val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get - val taskId = taskContextSupplier.getStageIdSupplier.get.toLong - val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get +HoodieJavaRDD.of( + injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => { +val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier +val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get +val taskId = taskContextSupplier.getStageIdSupplier.get.toLong +val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get - val writer = writeConfig.getIndexType match { -case HoodieIndex.IndexType.BUCKET if writeConfig.getBucketIndexEngineType - == BucketIndexEngineType.CONSISTENT_HASHING => - new ConsistentBucketBulkInsertDataInternalWriterHelper( -table, -writeConfig, -instantTime, -taskPartitionId, -taskId, -taskEpochId, -schema, -writeConfig.populateMetaFields, -arePartitionRecordsSorted, -shouldPres
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
yihua commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569859081 ## hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopFSUtils.java: ## @@ -154,4 +158,90 @@ public static FileStatus convertToHadoopFileStatus(StoragePathInfo pathInfo) { pathInfo.getModificationTime(), convertToHadoopPath(pathInfo.getPath())); } + + /** + * Fetch the right {@link FSDataInputStream} to be used by wrapping with required input streams. + * + * @param fs instance of {@link FileSystem} in use. + * @param filePath path of the file. + * @param bufferSize buffer size to be used. + * @return the right {@link FSDataInputStream} as required. + */ + public static FSDataInputStream getFSDataInputStream(FileSystem fs, + StoragePath filePath, + int bufferSize) { +FSDataInputStream fsDataInputStream = null; +try { + fsDataInputStream = fs.open(new Path(filePath.toUri()), bufferSize); Review Comment: Good catch. We have the util method. Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) 02/02: [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch release-0.14.1-hotfix in repository https://gitbox.apache.org/repos/asf/hudi.git commit 4d429ac918dae3116fbc9df420e0baccdf79ea0b Author: Jon Vexler AuthorDate: Mon Mar 11 17:25:41 2024 -0700 [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) * get rid of collect in row writer clustering * fix race condition * add logging - Co-authored-by: Jonathan Vexler <=> --- .../index/bucket/ConsistentBucketIndexUtils.java | 11 ++- .../hudi/HoodieDatasetBulkInsertHelper.scala | 89 +++--- 2 files changed, 55 insertions(+), 45 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java index f8befee9bf9..1e8abe09ecc 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIndexUtils.java @@ -208,7 +208,16 @@ public class ConsistentBucketIndexUtils { if (fs.exists(fullPath)) { return; } -FileIOUtils.createFileInPath(fs, fullPath, Option.of(StringUtils.EMPTY_STRING.getBytes())); +//prevent exception from race condition. We are ok with the file being created in another thread, so we should +// check for the marker after catching the exception and we don't need to fail if the file exists +try { + FileIOUtils.createFileInPath(fs, fullPath, Option.of(getUTF8Bytes(StringUtils.EMPTY_STRING))); +} catch (HoodieIOException e) { + if (!fs.exists(fullPath)) { +throw e; + } + LOG.warn("Failed to create marker but " + fullPath + " exists", e); +} } /*** diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala index 9623d4d1c09..95302e9ab37 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala @@ -26,6 +26,7 @@ import org.apache.hudi.common.engine.TaskContextSupplier import org.apache.hudi.common.model.HoodieRecord import org.apache.hudi.common.util.ReflectionUtils import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.data.HoodieJavaRDD import org.apache.hudi.exception.HoodieException import org.apache.hudi.index.HoodieIndex.BucketIndexEngineType import org.apache.hudi.index.{HoodieIndex, SparkHoodieIndexFactory} @@ -149,53 +150,53 @@ object HoodieDatasetBulkInsertHelper arePartitionRecordsSorted: Boolean, shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = { val schema = dataset.schema -val writeStatuses = injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => { - val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier - val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get - val taskId = taskContextSupplier.getStageIdSupplier.get.toLong - val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get +HoodieJavaRDD.of( + injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => { +val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier +val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get +val taskId = taskContextSupplier.getStageIdSupplier.get.toLong +val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get - val writer = writeConfig.getIndexType match { -case HoodieIndex.IndexType.BUCKET if writeConfig.getBucketIndexEngineType - == BucketIndexEngineType.CONSISTENT_HASHING => - new ConsistentBucketBulkInsertDataInternalWriterHelper( -table, -writeConfig, -instantTime, -taskPartitionId, -taskId, -taskEpochId, -schema, -writeConfig.populateMetaFields, -arePartitionRecordsSorted, -shouldPreserveHoodieMetadata) -case _ => - new BulkInsertDataInternalWriterHelper( -table, -writeConfig, -instantTime, -taskPartitionId, -taskId, -taskEpochId, -schema, -writeConfig.populateMetaFields, -arePartitionRecordsSorted, -shouldPreserveHoodieMetadata) - } +val writer = writeConfig.getIndexType match { + case HoodieIndex.IndexType.BUCKET if writeConfig.getBuc
(hudi) branch release-0.14.1-hotfix created (now 4d429ac918d)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch release-0.14.1-hotfix in repository https://gitbox.apache.org/repos/asf/hudi.git at 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) This branch includes the following new commits: new a1ea1da55a8 [HUDI-7479] SQL confs don't propagate to spark row writer properly (#10786) new 4d429ac918d [HUDI-7489] Avoid collecting WriteStatus to driver in row writer code path (#10836) The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
(hudi) 01/02: [HUDI-7479] SQL confs don't propagate to spark row writer properly (#10786)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch release-0.14.1-hotfix in repository https://gitbox.apache.org/repos/asf/hudi.git commit a1ea1da55a84e3db51a0a06be99645831d6bc0f3 Author: Jon Vexler AuthorDate: Tue Mar 5 04:31:55 2024 -0500 [HUDI-7479] SQL confs don't propagate to spark row writer properly (#10786) --- .../hudi/HoodieDatasetBulkInsertHelper.scala | 15 +- .../scala/org/apache/hudi/HoodieSparkUtils.scala | 2 +- .../common/testutils/HoodieTestDataGenerator.java | 15 +- .../deltastreamer/HoodieDeltaStreamerTestBase.java | 9 ++-- .../deltastreamer/TestHoodieDeltaStreamer.java | 24 +- 5 files changed, 51 insertions(+), 14 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala index 75ec069946d..9623d4d1c09 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala @@ -17,6 +17,7 @@ package org.apache.hudi +import org.apache.hudi.HoodieSparkUtils.injectSQLConf import org.apache.hudi.client.WriteStatus import org.apache.hudi.client.model.HoodieInternalRow import org.apache.hudi.common.config.TypedProperties @@ -40,11 +41,14 @@ import org.apache.spark.sql.HoodieUnsafeUtils.getNumPartitions import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.{Alias, Literal} import org.apache.spark.sql.catalyst.plans.logical.Project +import org.apache.spark.sql.execution.SQLConfInjectingRDD +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Dataset, HoodieUnsafeUtils, Row} import org.apache.spark.unsafe.types.UTF8String import scala.collection.JavaConverters.{asScalaBufferConverter, seqAsJavaListConverter} +import scala.reflect.ClassTag object HoodieDatasetBulkInsertHelper extends ParallelismHelper[DataFrame](toJavaSerializableFunctionUnchecked(df => getNumPartitions(df))) with Logging { @@ -80,8 +84,8 @@ object HoodieDatasetBulkInsertHelper val keyGeneratorClassName = config.getStringOrThrow(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME, "Key-generator class name is required") - val prependedRdd: RDD[InternalRow] = -df.queryExecution.toRdd.mapPartitions { iter => + val prependedRdd: RDD[InternalRow] = { +injectSQLConf(df.queryExecution.toRdd.mapPartitions { iter => val typedProps = new TypedProperties(config.getProps) if (autoGenerateRecordKeys) { typedProps.setProperty(KeyGenUtils.RECORD_KEY_GEN_PARTITION_ID_CONFIG, String.valueOf(TaskContext.getPartitionId())) @@ -107,7 +111,8 @@ object HoodieDatasetBulkInsertHelper // TODO use mutable row, avoid re-allocating new HoodieInternalRow(commitTimestamp, commitSeqNo, recordKey, partitionPath, filename, row, false) } -} +}, SQLConf.get) + } val dedupedRdd = if (config.shouldCombineBeforeInsert) { dedupeRows(prependedRdd, updatedSchema, config.getPreCombineField, SparkHoodieIndexFactory.isGlobalIndex(config)) @@ -144,7 +149,7 @@ object HoodieDatasetBulkInsertHelper arePartitionRecordsSorted: Boolean, shouldPreserveHoodieMetadata: Boolean): HoodieData[WriteStatus] = { val schema = dataset.schema -val writeStatuses = dataset.queryExecution.toRdd.mapPartitions(iter => { +val writeStatuses = injectSQLConf(dataset.queryExecution.toRdd.mapPartitions(iter => { val taskContextSupplier: TaskContextSupplier = table.getTaskContextSupplier val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get val taskId = taskContextSupplier.getStageIdSupplier.get.toLong @@ -189,7 +194,7 @@ object HoodieDatasetBulkInsertHelper } writer.getWriteStatuses.asScala.iterator -}).collect() +}), SQLConf.get).collect() table.getContext.parallelize(writeStatuses.toList.asJava) } diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala index a0fe879b3db..92adfdd4389 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala @@ -125,7 +125,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with SparkVersionsSupport wi }, SQLConf.get) } - private def injectSQLConf[T: ClassTag](rdd: RDD[T], conf: SQLConf): RDD[T] = + def injectSQLConf[T: ClassTa
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: For example,this could be used in unify sql engine open source or develop self by some business,could improve user sql such as unify casesensitivie or standard sql,this is not user behavior.This is a potential requirement with change. @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
jonvex commented on code in PR #11048: URL: https://github.com/apache/hudi/pull/11048#discussion_r1569817603 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -332,6 +350,18 @@ public boolean createNewFile(StoragePath path) throws IOException { } } + /** + * Opens an SeekableDataInputStream at the indicated path with seeks supported. + * + * @param path the file to open. + * @return the InputStream to read from. + * @throws IOException IO error. + */ + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + public SeekableDataInputStream openSeekable(StoragePath path) throws IOException { Review Comment: minor: should put this next to the other openSeekable def ## hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java: ## @@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException { assertTrue(storage.createDirectory(path4)); } + @Test + public void testSeekable() throws IOException { +HoodieStorage storage = getHoodieStorage(); +StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file"); +assertFalse(storage.exists(path)); +byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 79}; + +// By default, create overwrites the file +try (OutputStream stream = storage.create(path)) { + stream.write(data); + stream.flush(); +} + +try (SeekableDataInputStream seekableStream = storage.openSeekable(path)) { + validateSeekableDataInputStream(seekableStream, data); +} + +try (SeekableDataInputStream seekableStream = storage.openSeekable(path, 2)) { + validateSeekableDataInputStream(seekableStream, data); +} + } + + private void validateSeekableDataInputStream(SeekableDataInputStream seekableStream, + byte[] expectedData) throws IOException { +List positionList = new ArrayList<>(); +positionList.add(1); +positionList.add(expectedData.length / 2); +positionList.add(expectedData.length - 1); Review Comment: why add these 3 if they will already be in the range [0,expectedData.length) ? ## hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopFSUtils.java: ## @@ -154,4 +158,90 @@ public static FileStatus convertToHadoopFileStatus(StoragePathInfo pathInfo) { pathInfo.getModificationTime(), convertToHadoopPath(pathInfo.getPath())); } + + /** + * Fetch the right {@link FSDataInputStream} to be used by wrapping with required input streams. + * + * @param fs instance of {@link FileSystem} in use. + * @param filePath path of the file. + * @param bufferSize buffer size to be used. + * @return the right {@link FSDataInputStream} as required. + */ + public static FSDataInputStream getFSDataInputStream(FileSystem fs, + StoragePath filePath, + int bufferSize) { +FSDataInputStream fsDataInputStream = null; +try { + fsDataInputStream = fs.open(new Path(filePath.toUri()), bufferSize); Review Comment: What do you think about a helper method for this?: new Path(filePath.toUri()) or will that just add more clutter? ## hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieStorageBase.java: ## @@ -148,6 +150,46 @@ public void testCreateWriteAndRead() throws IOException { assertTrue(storage.createDirectory(path4)); } + @Test + public void testSeekable() throws IOException { +HoodieStorage storage = getHoodieStorage(); +StoragePath path = new StoragePath(getTempDir(), "testSeekable/1.file"); +assertFalse(storage.exists(path)); +byte[] data = new byte[] {2, 42, 49, (byte) 158, (byte) 233, 66, 9, 34, 79}; Review Comment: why did you need to cast the values over 100? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: For example,this could be used in unify sql engine open source or develop self by some business,could improve user sql such as unify casesensitivie or standard sql,this is a potential requirement with change. @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569824114 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: For example,this could be used in unify sql engine open source or develop self by some business,could improve user sql such as unify casesensitivie or standard sql,this is a potential requirement. @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
danny0405 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569813734 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: > When parquet real schema is (day,user),but user selected schema is (dAy,UseR) Why there is discrepency for the field case-sensitivity or why the schema specified by the user is inconsistent with the table schema definition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4228) Clean up literal usage in Hudi CLI argument check
[ https://issues.apache.org/jira/browse/HUDI-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-4228: - Fix Version/s: 1.0.0 > Clean up literal usage in Hudi CLI argument check > - > > Key: HUDI-4228 > URL: https://issues.apache.org/jira/browse/HUDI-4228 > Project: Apache Hudi > Issue Type: Improvement > Components: cli >Reporter: Ethan Guo >Assignee: Vova Kolmakov >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > In "org.apache.hudi.cli.commands.SparkMain", the logic for checking number of > arguments for different Hudi CLI commands is hardcoded with literals like > this: > {code:java} > case COMPACT_RUN: > assert (args.length >= 10); > propsFilePath = null; > if (!StringUtils.isNullOrEmpty(args[9])) { > propsFilePath = args[9]; > } > configs = new ArrayList<>(); > if (args.length > 10) { > configs.addAll(Arrays.asList(args).subList(9, args.length)); > } > returnCode = compact(jsc, args[3], args[4], args[5], > Integer.parseInt(args[6]), args[7], > Integer.parseInt(args[8]), HoodieCompactor.EXECUTE, propsFilePath, > configs); > break; {code} > We should have a better way of validating this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4228) Clean up literal usage in Hudi CLI argument check
[ https://issues.apache.org/jira/browse/HUDI-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-4228. Resolution: Fixed Fixed via master branch: 4c04216d5cade10f9589e2c0d425109cc502ebcd > Clean up literal usage in Hudi CLI argument check > - > > Key: HUDI-4228 > URL: https://issues.apache.org/jira/browse/HUDI-4228 > Project: Apache Hudi > Issue Type: Improvement > Components: cli >Reporter: Ethan Guo >Assignee: Vova Kolmakov >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > In "org.apache.hudi.cli.commands.SparkMain", the logic for checking number of > arguments for different Hudi CLI commands is hardcoded with literals like > this: > {code:java} > case COMPACT_RUN: > assert (args.length >= 10); > propsFilePath = null; > if (!StringUtils.isNullOrEmpty(args[9])) { > propsFilePath = args[9]; > } > configs = new ArrayList<>(); > if (args.length > 10) { > configs.addAll(Arrays.asList(args).subList(9, args.length)); > } > returnCode = compact(jsc, args[3], args[4], args[5], > Integer.parseInt(args[6]), args[7], > Integer.parseInt(args[8]), HoodieCompactor.EXECUTE, propsFilePath, > configs); > break; {code} > We should have a better way of validating this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 4c04216d5ca [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042) 4c04216d5ca is described below commit 4c04216d5cade10f9589e2c0d425109cc502ebcd Author: Vova Kolmakov AuthorDate: Thu Apr 18 09:14:32 2024 +0700 [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042) --- .../org/apache/hudi/cli/commands/SparkMain.java| 202 +++-- .../org/apache/hudi/cli/ArchiveExecutorUtils.java | 2 +- 2 files changed, 69 insertions(+), 135 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java index 5e1aec52333..ba3b404474b 100644 --- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java @@ -19,14 +19,12 @@ package org.apache.hudi.cli.commands; import org.apache.hudi.DataSourceWriteOptions; +import org.apache.hudi.cli.ArchiveExecutorUtils; import org.apache.hudi.cli.utils.SparkUtil; import org.apache.hudi.client.SparkRDDWriteClient; import org.apache.hudi.client.common.HoodieSparkEngineContext; -import org.apache.hudi.client.timeline.HoodieTimelineArchiver; -import org.apache.hudi.common.config.HoodieMetadataConfig; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.engine.HoodieEngineContext; -import org.apache.hudi.common.model.HoodieAvroPayload; import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.model.WriteOperationType; @@ -37,7 +35,6 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.PartitionPathEncodeUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.common.util.ValidationUtils; -import org.apache.hudi.config.HoodieArchivalConfig; import org.apache.hudi.config.HoodieBootstrapConfig; import org.apache.hudi.config.HoodieCleanConfig; import org.apache.hudi.config.HoodieIndexConfig; @@ -99,16 +96,45 @@ public class SparkMain { * Commands. */ enum SparkCommand { -BOOTSTRAP, ROLLBACK, DEDUPLICATE, ROLLBACK_TO_SAVEPOINT, SAVEPOINT, IMPORT, UPSERT, COMPACT_SCHEDULE, COMPACT_RUN, COMPACT_SCHEDULE_AND_EXECUTE, -COMPACT_UNSCHEDULE_PLAN, COMPACT_UNSCHEDULE_FILE, COMPACT_VALIDATE, COMPACT_REPAIR, CLUSTERING_SCHEDULE, -CLUSTERING_RUN, CLUSTERING_SCHEDULE_AND_EXECUTE, CLEAN, DELETE_MARKER, DELETE_SAVEPOINT, UPGRADE, DOWNGRADE, -REPAIR_DEPRECATED_PARTITION, RENAME_PARTITION, ARCHIVE +BOOTSTRAP(18), ROLLBACK(6), DEDUPLICATE(8), ROLLBACK_TO_SAVEPOINT(6), SAVEPOINT(7), +IMPORT(13), UPSERT(13), COMPACT_SCHEDULE(7), COMPACT_RUN(10), COMPACT_SCHEDULE_AND_EXECUTE(9), +COMPACT_UNSCHEDULE_PLAN(9), COMPACT_UNSCHEDULE_FILE(10), COMPACT_VALIDATE(7), COMPACT_REPAIR(8), +CLUSTERING_SCHEDULE(7), CLUSTERING_RUN(9), CLUSTERING_SCHEDULE_AND_EXECUTE(8), CLEAN(5), +DELETE_MARKER(5), DELETE_SAVEPOINT(5), UPGRADE(5), DOWNGRADE(5), +REPAIR_DEPRECATED_PARTITION(4), RENAME_PARTITION(6), ARCHIVE(8); + +private final int minArgsCount; + +SparkCommand(int minArgsCount) { + this.minArgsCount = minArgsCount; +} + +void assertEq(int factArgsCount) { + ValidationUtils.checkArgument(factArgsCount == minArgsCount); +} + +void assertGtEq(int factArgsCount) { + ValidationUtils.checkArgument(factArgsCount >= minArgsCount); +} + +List makeConfigs(String[] args) { + List configs = new ArrayList<>(); + if (args.length > minArgsCount) { +configs.addAll(Arrays.asList(args).subList(minArgsCount, args.length)); + } + return configs; +} + +String getPropsFilePath(String[] args) { + return (args.length >= minArgsCount && !StringUtils.isNullOrEmpty(args[minArgsCount - 1])) + ? args[minArgsCount - 1] : null; +} } - public static void main(String[] args) throws Exception { + public static void main(String[] args) { ValidationUtils.checkArgument(args.length >= 4); final String commandString = args[0]; -LOG.info("Invoking SparkMain: " + commandString); +LOG.info("Invoking SparkMain: {}", commandString); final SparkCommand cmd = SparkCommand.valueOf(commandString); JavaSparkContext jsc = SparkUtil.initJavaSparkContext("hoodie-cli-" + commandString, @@ -116,193 +142,112 @@ public class SparkMain { int returnCode = 0; try { + cmd.assertGtEq(args.length); + List configs = cmd.makeConfigs(args); + String propsFilePath = cmd.getPropsFilePath(args); switch (cmd) { case ROLLBACK: - assert (args.length == 6); + cmd.assertEq(args.l
Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]
danny0405 merged PR #11042: URL: https://github.com/apache/hudi/pull/11042 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]
hudi-bot commented on PR #11031: URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062867237 ## CI report: * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN * 5ffea9f5122b0302cc254c88edb4fbe5feb05d39 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23321) * a7270a970988e18ba0997ee59b74678c2dc720ed Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23336) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]
hudi-bot commented on PR #11042: URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062861299 ## CI report: * 3892e87e4b9a45c05090890cbda223a59e87eb2a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23332) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]
yihua commented on PR #10591: URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062860564 I need to check a few more things before landing this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]
hudi-bot commented on PR #11031: URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062861117 ## CI report: * e159472757b2475611e99dc4afd8fe2def6967f4 UNKNOWN * 5ffea9f5122b0302cc254c88edb4fbe5feb05d39 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23321) * a7270a970988e18ba0997ee59b74678c2dc720ed UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7637] Make StoragePathInfo Comparable [hudi]
hudi-bot commented on PR #11050: URL: https://github.com/apache/hudi/pull/11050#issuecomment-2062861431 ## CI report: * f7a837d2512ae7588af5e18a596c4c58167c27cb Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23334) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]
hudi-bot commented on PR #10591: URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062860147 ## CI report: * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN * 5b08db2d947b74dc26873f4ec7c6022a52e07193 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23335) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569794748 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: OK,eg: 1)When parquet real schema is (Day,day,user), if caseSensitive is true,selected schema is (Day,day,user) would be push down to read parquet success When build ParquetFileReader,this sence is OK; 2) When parquet real schema is (day,user),but user selected schema is (dAy,UseR),if caseSensitive is true it would not be push down to read parquet success;this sence should change caseSensitive to false to fit it. @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7636] Make StoragePath Serializable [hudi]
hudi-bot commented on PR #11049: URL: https://github.com/apache/hudi/pull/11049#issuecomment-2062855268 ## CI report: * 922aaf83efb3988273d62aa2de2977ee40ff860e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1569793746 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: Maybe `listFiles` ? Should we return an iterator instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [hudi]
zhuanshenbsj1 commented on PR #11031: URL: https://github.com/apache/hudi/pull/11031#issuecomment-2062848895 @danny0405 cc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7591) Implement InlineFS in HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7591: Status: In Progress (was: Open) > Implement InlineFS in HoodieStorage > --- > > Key: HUDI-7591 > URL: https://issues.apache.org/jira/browse/HUDI-7591 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7616) Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple
[ https://issues.apache.org/jira/browse/HUDI-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7616. --- Resolution: Fixed > Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple > -- > > Key: HUDI-7616 > URL: https://issues.apache.org/jira/browse/HUDI-7616 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-7007: - Assignee: Sagar Sumit (was: Vinaykumar Bhat) > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1569772713 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -242,7 +252,44 @@ protected Pair, Schema> getRecordsIterator(HoodieDataBlock d } else { blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext); } -return Pair.of(blockRecordsIterator, dataBlock.getSchema()); +Option, Schema>> schemaEvolutionTransformerOpt = Review Comment: To clarify, do we put the common schema evolution logic in the file group reader or the record buffer classes? If that's the case, Spark parquet reader does not have to handle schema evolution, and we have common logic in the file group reader or the record buffer classes for schema on read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7636) Make StoragePath Serializable
[ https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7636: Epic Link: HUDI-6243 > Make StoragePath Serializable > - > > Key: HUDI-7636 > URL: https://issues.apache.org/jira/browse/HUDI-7636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable
[ https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7637: Epic Link: HUDI-6243 > Make StoragePathInfo Comparable > --- > > Key: HUDI-7637 > URL: https://issues.apache.org/jira/browse/HUDI-7637 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7348) Replace Configuration with StorageConfiguration for storage configuration
[ https://issues.apache.org/jira/browse/HUDI-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7348: Epic Link: HUDI-6243 > Replace Configuration with StorageConfiguration for storage configuration > - > > Key: HUDI-7348 > URL: https://issues.apache.org/jira/browse/HUDI-7348 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs
[ https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7634: Epic Link: HUDI-6243 > Rename HoodieStorage APIs > - > > Key: HUDI-7634 > URL: https://issues.apache.org/jira/browse/HUDI-7634 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > getHoodieStorage -> getStorage > listDirectEntries -> listDirectory -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable
[ https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7633: Epic Link: HUDI-7537 > Use try with resources for AutoCloseable > > > Key: HUDI-7633 > URL: https://issues.apache.org/jira/browse/HUDI-7633 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Epic Link: HUDI-6243 > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable
[ https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7637: Story Points: 0.5 > Make StoragePathInfo Comparable > --- > > Key: HUDI-7637 > URL: https://issues.apache.org/jira/browse/HUDI-7637 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs
[ https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7634: Story Points: 0 > Rename HoodieStorage APIs > - > > Key: HUDI-7634 > URL: https://issues.apache.org/jira/browse/HUDI-7634 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > getHoodieStorage -> getStorage > listDirectEntries -> listDirectory -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Story Points: 0.5 > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7636) Make StoragePath Serializable
[ https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7636: Story Points: 1 > Make StoragePath Serializable > - > > Key: HUDI-7636 > URL: https://issues.apache.org/jira/browse/HUDI-7636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable
[ https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7633: Story Points: 0.5 > Use try with resources for AutoCloseable > > > Key: HUDI-7633 > URL: https://issues.apache.org/jira/browse/HUDI-7633 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7596) Enable Jacoco code coverage report across multiple modules
[ https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838411#comment-17838411 ] Danny Chen commented on HUDI-7596: -- The link [https://www.baeldung.com/maven-jacoco-multi-module-projecthttps://www.baeldung.com/maven-jacoco-multi-module-project] actually turns out to be a 404. > Enable Jacoco code coverage report across multiple modules > -- > > Key: HUDI-7596 > URL: https://issues.apache.org/jira/browse/HUDI-7596 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Danny Chen >Priority: Major > Labels: starter > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7616) Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple
[ https://issues.apache.org/jira/browse/HUDI-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7616: Story Points: 1 > Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple > -- > > Key: HUDI-7616 > URL: https://issues.apache.org/jira/browse/HUDI-7616 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]
wombatu-kun commented on PR #11042: URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062812536 but all refactoring operations are quite trivial, also i checked 3 times settings of min arguments count for all commands. so i think this patch should not break anything for sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage [hudi]
hudi-bot commented on PR #11048: URL: https://github.com/apache/hudi/pull/11048#issuecomment-2062811641 ## CI report: * 613cc856942f602fe6cef7b9c4d81991af15a520 UNKNOWN * 907ec1e2dc29edf0a6cca292673e48d040258d41 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23331) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1569752704 ## hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java: ## @@ -231,7 +231,13 @@ private static Option findNestedField(Schema schema, String[] fiel if (!nestedPart.isPresent()) { return Option.empty(); } -return nestedPart; +boolean isUnion = false; Review Comment: Could you explain this change? This affects the logic of all callers, not just schema evolution. ## hudi-io/src/main/java/org/apache/hudi/common/util/ValidationUtils.java: ## @@ -76,4 +76,10 @@ public static void checkState(final boolean expression, String errorMessage) { throw new IllegalStateException(errorMessage); } } + + public static void checkNotNull(Object o) { +if (o == null) { + throw new IllegalStateException(); Review Comment: Add an error message to the exception. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala: ## @@ -64,7 +65,7 @@ abstract class HoodieBaseHadoopFsRelationFactory(val sqlContext: SQLContext, val schemaSpec: Option[StructType] ) extends SparkAdapterSupport with HoodieHadoopFsRelationFactory { protected lazy val sparkSession: SparkSession = sqlContext.sparkSession - protected lazy val optParams: Map[String, String] = options + protected var optParams: Map[String, String] = options Review Comment: Can we keep this immutable and modify the input `options`? ## hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java: ## @@ -211,6 +221,10 @@ public Map updateSchemaAndResetOrderingValInMetadata(Map projectRecord(Schema from, Schema to); + public UnaryOperator projectRecordUnsafe(Schema from, Schema to, Map renamedColumns) { Review Comment: Can this be unified with `projectRecord` instead of adding a new one? ## hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java: ## @@ -48,6 +49,12 @@ *and {@code RowData} in Flink. */ public abstract class HoodieReaderContext { + protected HoodieFileGroupReaderState readerState = new HoodieFileGroupReaderState<>(); + + public HoodieFileGroupReaderState getReaderState() { +return readerState; + } + Review Comment: Can this be passed in and immutable? It's bad idea to modify this just for tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]
hudi-bot commented on PR #10591: URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062810967 ## CI report: * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN * 888fc20e7e8ef343091fa88645b96a959fa1c662 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23325) * 5b08db2d947b74dc26873f4ec7c6022a52e07193 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23335) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]
wombatu-kun commented on PR #11042: URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062809664 > @wombatu-kun Did you have any chance to test the HUDI CLI locally with your patch? no, unfortunately i don't have an environment to test it locally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]
hudi-bot commented on PR #10591: URL: https://github.com/apache/hudi/pull/10591#issuecomment-2062805303 ## CI report: * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN * 7c517227bb1079621647852c99dd7836f9900025 UNKNOWN * e89e4e0bcb756832c22779a5ccf259c5e69c0e0d UNKNOWN * 888fc20e7e8ef343091fa88645b96a959fa1c662 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23325) * 5b08db2d947b74dc26873f4ec7c6022a52e07193 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
hudi-bot commented on PR #11047: URL: https://github.com/apache/hudi/pull/11047#issuecomment-2062799489 ## CI report: * ce538c2a343e865608bb8946b9fdc2aa1c383e71 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23330) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1569730199 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -71,24 +64,36 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, isIncremental: Boolean, shouldUseRecordPosition: Boolean, requiredFilters: Seq[Filter] - ) extends ParquetFileFormat with SparkAdapterSupport with HoodieFormatTrait { + ) extends ParquetFileFormat with SparkAdapterSupport with HoodieFormatTrait { def getRequiredFilters: Seq[Filter] = requiredFilters + private val sanitizedTableName = AvroSchemaUtils.getAvroRecordQualifiedName(tableName) + /** * Support batch needs to remain consistent, even if one side of a bootstrap merge can support * while the other side can't */ - private var supportBatchCalled = false - private var supportBatchResult = false + /* +private var supportBatchCalled = false +private var supportBatchResult = false + +override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { + if (!supportBatchCalled || supportBatchResult) { +supportBatchCalled = true +supportBatchResult = tableSchema.internalSchema.isEmpty && !isMOR && !isIncremental && !isBootstrap && super.supportBatch(sparkSession, schema) + } + supportBatchResult +} + */ + override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = false Review Comment: Are these still needed? If not, remove the lines? ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala: ## @@ -80,16 +79,17 @@ class TestHoodieFileGroupReaderOnSpark extends TestHoodieFileGroupReaderBase[Int tempDir.toAbsolutePath.toUri.toString } - override def getHoodieReaderContext(tablePath: String, avroSchema: Schema): HoodieReaderContext[InternalRow] = { -val parquetFileFormat = new ParquetFileFormat -val structTypeSchema = AvroConversionUtils.convertAvroSchemaToStructType(avroSchema) - -val recordReaderIterator = parquetFileFormat.buildReaderWithPartitionValues( - spark, structTypeSchema, StructType(Seq.empty), structTypeSchema, Seq.empty, Map.empty, getHadoopConf) - -val m = scala.collection.mutable.Map[Long, PartitionedFile => Iterator[InternalRow]]() -m.put(2*avroSchema.hashCode(), recordReaderIterator) -new SparkFileFormatInternalRowReaderContext(m) + override def getHoodieReaderContext(tablePath: String, avroSchema: Schema, hadoopConf: Configuration): HoodieReaderContext[InternalRow] = { +val reader = sparkAdapter.createParquetFileReader(vectorized = false, spark.sessionState.conf, Map.empty, hadoopConf) +val metaClient = HoodieTableMetaClient.builder().setConf(getHadoopConf).setBasePath(tablePath).build +val recordKeyField = if (metaClient.getTableConfig.populateMetaFields()) { + HoodieRecord.RECORD_KEY_METADATA_FIELD +} else { + val keyFields = metaClient.getTableConfig.getRecordKeyFields.get() + checkState(keyFields.length == 1) + keyFields.head +} Review Comment: Could this be wrapped into a util method? I think virtual key support may have already added such method. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -107,19 +112,23 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, val dataSchema = StructType(tableSchema.structTypeSchema.fields.filterNot(f => partitionColumns.contains(f.name))) val outputSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) spark.conf.set("spark.sql.parquet.enableVectorizedReader", supportBatchResult) -val requiredSchemaWithMandatory = generateRequiredSchemaWithMandatory(requiredSchema, dataSchema, partitionSchema) -val isCount = requiredSchemaWithMandatory.isEmpty -val requiredSchemaSplits = requiredSchemaWithMandatory.fields.partition(f => HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name)) -val requiredMeta = StructType(requiredSchemaSplits._1) -val requiredWithoutMeta = StructType(requiredSchemaSplits._2) +val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental val augmentedHadoopConf = FSUtils.buildInlineConf(hadoopConf) -val (baseFileReader, preMergeBaseFileReader, readerMaps, cdcFileReader) = buildFileReaders( - spark, dataSchema, partitionSchema, requiredSchema, filters, options, augmentedHadoopConf, - requ
Re: [PR] [HUDI-4228] Clean up literal usage in Hudi CLI argument check [hudi]
danny0405 commented on PR #11042: URL: https://github.com/apache/hudi/pull/11042#issuecomment-2062796467 @wombatu-kun Did you have any chance to test the HUDI CLI locally with your patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
danny0405 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1569738186 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: As mentioned above, please give an illustrative example here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
danny0405 commented on code in PR #11008: URL: https://github.com/apache/hudi/pull/11008#discussion_r1569737390 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java: ## @@ -89,10 +89,13 @@ protected List getSmallFiles(String partitionPath) { private List getSmallFileCandidates(String partitionPath, HoodieInstant latestCommitInstant) { // If we can index log files, we can add more inserts to log files for fileIds NOT including those under // pending compaction +Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice)) +.thenComparing(FileSlice::getFileId); if (table.getIndex().canIndexLogFiles()) { return table.getSliceView() .getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false) .filter(this::isSmallFile) + .sorted(comparator) .collect(Collectors.toList()); Review Comment: Fine, just fix all the test failures and let's see what use cases are affected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Order rows with same key before precombine [hudi]
danny0405 commented on issue #11041: URL: https://github.com/apache/hudi/issues/11041#issuecomment-2062788935 Currently only internal `HFile` enables the sort of payloads within a file, for PARQUETs in the dataset table, the merge would break the sequence in anyway. Take https://github.com/apache/hudi/blob/6c6bddcef3ec383b08eb10f10ab0400f4edc41f4/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java#L54 for a reference. I guess you might want to make the sorting configurable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]
danny0405 commented on issue #10914: URL: https://github.com/apache/hudi/issues/10914#issuecomment-2062778638 > Would the checkpoint of job 3 contains all index information retrieved from the index bootstrap process in job 2? Asking this as I noticed a significant size differences between the checkpoint of job 2 and job 3. (500GB in job 2 vs < 50GB in job 3) yes, one successful checkpoint indicates the bootstrap has finished. > If job 3 fails and I need to start a job 4 using job 3's latest checkpoint, do I need to have index bootstrap enabled? No need to do that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
yihua commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1569721067 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -138,6 +138,7 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { spark.sessionState.catalog.dropTable(TableIdentifier(tableName), true, true) spark.sessionState.catalog.refreshTable(TableIdentifier(tableName)) spark.sessionState.conf.unsetConf(DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION.key) + spark.sessionState.conf.unsetConf("spark.sql.storeAssignmentPolicy") Review Comment: What is this used for? ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionBase.java: ## @@ -156,7 +156,7 @@ protected HoodieDeltaStreamer.Config getDeltaStreamerConfig(String[] transformer protected HoodieDeltaStreamer.Config getDeltaStreamerConfig(String[] transformerClasses, boolean nullForDeletedCols, TypedProperties extraProps) throws IOException { -extraProps.setProperty(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "false"); +extraProps.setProperty(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "true"); Review Comment: Completely remove this line as file group reader is turned on by default? ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -706,6 +709,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { } test("Test schema auto evolution") { +//This test will be flakey for mor until [HUDI-6798] is landed and we can set the merge mode Review Comment: How does the test pass for MOR before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable
[ https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7633: Status: In Progress (was: Open) > Use try with resources for AutoCloseable > > > Key: HUDI-7633 > URL: https://issues.apache.org/jira/browse/HUDI-7633 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable
[ https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7633: Status: Patch Available (was: In Progress) > Use try with resources for AutoCloseable > > > Key: HUDI-7633 > URL: https://issues.apache.org/jira/browse/HUDI-7633 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7636) Make StoragePath Serializable
[ https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7636: Labels: hoodie-storage pull-request-available (was: pull-request-available) > Make StoragePath Serializable > - > > Key: HUDI-7636 > URL: https://issues.apache.org/jira/browse/HUDI-7636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7635: --- Assignee: Ethan Guo > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Fix Version/s: 0.15.0 > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Status: In Progress (was: Open) > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs
[ https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7634: Status: In Progress (was: Open) > Rename HoodieStorage APIs > - > > Key: HUDI-7634 > URL: https://issues.apache.org/jira/browse/HUDI-7634 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > getHoodieStorage -> getStorage > listDirectEntries -> listDirectory -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Labels: hoodie-storage pull-request-available (was: pull-request-available) > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)