[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674291201 ## CI report: * 642c6dd967978781d41b74138f89fae26192056b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leosanqing commented on a diff in pull request #9297: Generate test jars for hudi-utilities and hudi-hive-sync modules
leosanqing commented on code in PR #9297: URL: https://github.com/apache/hudi/pull/9297#discussion_r1290962462 ## hudi-sync/hudi-hive-sync/pom.xml: ## @@ -200,6 +200,9 @@ + + false + Review Comment: > Weird, I can not reproduce it, maybe it is because of your local mvn repository env. hello, I also encountered this problem. when I use this command to compile project. `mvn clean install -DskipTests -Dscala-2.12 -Dspark3.2 -Dmaven.test.skip=true -Dcheckstyle.skip=true -Dflink1.16 -Drat.skip=true` `[ERROR] Failed to execute goal on project hudi-utilities_2.12: Could not resolve dependencies for project org.apache.hudi:hudi-utilities_2.12:jar:0.15.0-SNAPSHOT: org.apache.hudi:hudi-hive-sync:jar:tests:0.15.0-SNAPSHOT was not found in https://packages.confluent.io/maven/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of confluent has elapsed or updates are forced -> [Help 1] ` This test jar, I don't know how to generate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
SteNicholas commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1290959520 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java: ## @@ -80,12 +104,42 @@ public static ClosableIterator getParquetRecordIterator( batchSize, path, splitStart, - splitLength)); + splitLength, + filterPredicate, + recordFilter)); if (castProjection.isPresent()) { return new SchemaEvolvedRecordIterator(itr, castProjection.get()); } else { return itr; } } } + + private static FilterPredicate getFilterPredicate(Configuration configuration) { +try { + return SerializationUtil.readObjectFromConfAsBase64(FILTER_PREDICATE, configuration); +} catch (IOException e) { Review Comment: @danny0405, the filters could be passed to Hadoop's configuration entries prefixed with `FILTER_PREDICATE` that is `parquet.private.read.filter.predicate` for `HoodieTableSource#getParquetConf` and used by one of the available readers, `VectorizedParquetRecordReader` or `ParquetRecordReader`. Meanwhile, `UNBOUND_RECORD_FILTER` which is `parquet.read.filter` is used for native parquet read filter cofiguration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
Zouxxyy commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674276179 @suryaprasanna @yihua @prashantwason can you help with a revew~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674256334 ## CI report: * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264) * 482f63ffe2df3fbaf0176a175b530082e0f31154 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19265) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674250531 ## CI report: * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262) * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264) * 482f63ffe2df3fbaf0176a175b530082e0f31154 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674220535 ## CI report: * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262) * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19264) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674216082 ## CI report: * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662) * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262) * f0ae5ade08e4d983ebc3fd23edfb5def3b0d1aef UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6684) Follow up/ fix missing records from bloom filter partition in MDT
[ https://issues.apache.org/jira/browse/HUDI-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753046#comment-17753046 ] Sagar Sumit commented on HUDI-6684: --- Let's think about when this could happen, but if it is missing then why not simply add it? > Follow up/ fix missing records from bloom filter partition in MDT > - > > Key: HUDI-6684 > URL: https://issues.apache.org/jira/browse/HUDI-6684 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > As of now, if a bloom filter for a file is missing from bloom filter > partition in MDT, we ignore it. > HoodieMetadataTableUtil > {code:java} > // If reading the bloom filter failed then do not add a record for this file > if (bloomFilterBuffer == null) { > LOG.error("Failed to read bloom filter from " + addedFilePath); > return Stream.empty().iterator(); > } > } {code} > we should think about on what scenario, this is possible and how exactly we > can handle such situations. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
codope commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1290894426 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -848,64 +851,49 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams, String instantTime) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); - -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} - -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); +partitionToDeletedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new Tuple3<>(entry.getKey(), deletedFile, true))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); +partitionToAppendedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> new Tuple3<>(entry.getKey(), addedFile, false))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); + +// Create records MDT +int parallelism = Math.max(Math.min(partitionFileFlagTupleList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); +return engineContext.parallelize(partitionFileFlagTupleList, parallelism).flatMap(partitionFileFlagTuple -> { + final String partitionName = partitionFileFlagTuple._1(); + final String filename = partitionFileFlagTuple._2(); + final boolean isDeleted = partitionFileFlagTuple._3(); + if (!FSUtils.isBaseFile(new Path(filename))) { +LOG.warn(String.format("Ignoring file %s as it is not a base file", filename)); +return Stream.empty().iterator(); + } -List>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet() -.stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList()); -parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList, parallelism); + // Read the bloom filter from the base file if the file is being added + ByteBuffer bloomFilterBuffer = ByteBuffer.allocate(0); + if (!isDeleted) { +final String pathWithPartition = partitionName + "/" + filename; +final Path addedFilePath = new Path(recordsGenerationParams.getDataMetaClient().getBasePath(), pathWithPartition); +bloomFilterBuffer = readBloomFilter(recordsGenerationParams.getDataMetaClient().getHadoopConf(), addedFilePath); + +// If reading the bloom filter failed then do not add a record for this file +if (bloomFilterBuffer == null) { + LOG.error("Failed to read bloom filter from " + addedFilePath); + return Stream.empty().iterator(); Review Comment: why not simply add to bloom? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, ple
[jira] [Closed] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions
[ https://issues.apache.org/jira/browse/HUDI-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-6677. - Resolution: Not A Problem > Make HoodieRecordIndexInfo schema compatible with older versions > > > Key: HUDI-6677 > URL: https://issues.apache.org/jira/browse/HUDI-6677 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently the metadata payload schema for record index can cause schema > evolution issues for existing hudi tables. The Jira aims to fix these issues. > There are two schema evolution issues -: > 1. The field name has changed from partition to partitionName. > 2. Also we have added a new field fileId in between a nested schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] lokeshj1703 closed pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo schema compatible with older versions
lokeshj1703 closed pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo schema compatible with older versions URL: https://github.com/apache/hudi/pull/9415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674190791 ## CI report: * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19256) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674190580 ## CI report: * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662) * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19262) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674190763 ## CI report: * 3792e6de4fbf7642011c3d723f8e514f89c991ae Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19248) * 642c6dd967978781d41b74138f89fae26192056b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19263) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
hudi-bot commented on PR #9416: URL: https://github.com/apache/hudi/pull/9416#issuecomment-1674185958 ## CI report: * 3792e6de4fbf7642011c3d723f8e514f89c991ae Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19248) * 642c6dd967978781d41b74138f89fae26192056b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674185739 ## CI report: * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662) * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN * 4afb334b5018c2bd9888716ed5e9abbeb4d10589 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674181982 ## CI report: * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662) * d773045840e52ecc767bfa8716a3a3287ee6aa93 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
nsivabalan commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1674179663 @codope : addressed all feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
nsivabalan commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863005 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -848,64 +851,49 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams, String instantTime) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); - -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} - -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); +partitionToDeletedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new Tuple3<>(entry.getKey(), deletedFile, true))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); +partitionToAppendedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> new Tuple3<>(entry.getKey(), addedFile, false))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); Review Comment: there are some minor difference b/w col stats and bloom filter wrt log file handling. So, may be we can leave it as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6684) Follow up/ fix missing records from bloom filter partition in MDT
sivabalan narayanan created HUDI-6684: - Summary: Follow up/ fix missing records from bloom filter partition in MDT Key: HUDI-6684 URL: https://issues.apache.org/jira/browse/HUDI-6684 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan As of now, if a bloom filter for a file is missing from bloom filter partition in MDT, we ignore it. HoodieMetadataTableUtil {code:java} // If reading the bloom filter failed then do not add a record for this file if (bloomFilterBuffer == null) { LOG.error("Failed to read bloom filter from " + addedFilePath); return Stream.empty().iterator(); } } {code} we should think about on what scenario, this is possible and how exactly we can handle such situations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
nsivabalan commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863860 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -848,64 +851,49 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams, String instantTime) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); - -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} - -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); +partitionToDeletedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new Tuple3<>(entry.getKey(), deletedFile, true))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); +partitionToAppendedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> new Tuple3<>(entry.getKey(), addedFile, false))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); + +// Create records MDT +int parallelism = Math.max(Math.min(partitionFileFlagTupleList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); +return engineContext.parallelize(partitionFileFlagTupleList, parallelism).flatMap(partitionFileFlagTuple -> { + final String partitionName = partitionFileFlagTuple._1(); + final String filename = partitionFileFlagTuple._2(); + final boolean isDeleted = partitionFileFlagTuple._3(); + if (!FSUtils.isBaseFile(new Path(filename))) { +LOG.warn(String.format("Ignoring file %s as it is not a base file", filename)); +return Stream.empty().iterator(); + } -List>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet() -.stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList()); -parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList, parallelism); + // Read the bloom filter from the base file if the file is being added + ByteBuffer bloomFilterBuffer = ByteBuffer.allocate(0); + if (!isDeleted) { +final String pathWithPartition = partitionName + "/" + filename; +final Path addedFilePath = new Path(recordsGenerationParams.getDataMetaClient().getBasePath(), pathWithPartition); +bloomFilterBuffer = readBloomFilter(recordsGenerationParams.getDataMetaClient().getHadoopConf(), addedFilePath); + +// If reading the bloom filter failed then do not add a record for this file +if (bloomFilterBuffer == null) { + LOG.error("Failed to read bloom filter from " + addedFilePath); + return Stream.empty().iterator(); Review Comment: https://issues.apache.org/jira/browse/HUDI-6684 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
nsivabalan commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1290863005 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -848,64 +851,49 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams, String instantTime) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); - -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} - -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); +partitionToDeletedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new Tuple3<>(entry.getKey(), deletedFile, true))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); +partitionToAppendedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> new Tuple3<>(entry.getKey(), addedFile, false))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); Review Comment: there are some minor difference b/w col stats and bloom filter wrt log file handling. So, may be we can leave it as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
nsivabalan commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1290861309 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -915,65 +903,60 @@ public static HoodieData convertFilesToColumnStatsRecords(HoodieEn Map> partitionToDeletedFiles, Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); +// Find the columns to index HoodieTableMetaClient dataTableMetaClient = recordsGenerationParams.getDataMetaClient(); - final List columnsToIndex = getColumnsToIndex(recordsGenerationParams, Lazy.lazily(() -> tryResolveSchemaForTable(dataTableMetaClient))); - if (columnsToIndex.isEmpty()) { // In case there are no columns to index, bail return engineContext.emptyHoodieData(); } -final List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream() -.map(e -> Pair.of(e.getKey(), e.getValue())) -.collect(Collectors.toList()); - -int deletedFilesTargetParallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getColumnStatsIndexParallelism()), 1); -final HoodieData>> partitionToDeletedFilesRDD = -engineContext.parallelize(partitionToDeletedFilesList, deletedFilesTargetParallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionPath = partitionToDeletedFilesPair.getLeft(); - final String partitionId = getPartitionIdentifier(partitionPath); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - - return deletedFileList.stream().flatMap(deletedFile -> { -final String filePathWithPartition = partitionPath + "/" + deletedFile; -return getColumnStatsRecords(partitionId, filePathWithPartition, dataTableMetaClient, columnsToIndex, true); - }).iterator(); -}); - -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); - -final List>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet().stream() -.map(entry -> Pair.of(entry.getKey(), entry.getValue())) -.collect(Collectors.toList()); - -int appendedFilesTargetParallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), recordsGenerationParams.getColumnStatsIndexParallelism()), 1); -final HoodieData>> partitionToAppendedFilesRDD = -engineContext.parallelize(partitionToAppendedFilesList, appendedFilesTargetParallelism); - -HoodieData appendedFilesRecordsRDD = partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesPair -> { - final String partitionPath = partitionToAppendedFilesPair.getLeft(); - final String partitionId = getPartitionIdentifier(partitionPath); - final Map appendedFileMap = partitionToAppendedFilesPair.getRight(); +LOG.info(String.format("Indexing %d columns for column stats index", columnsToIndex.size())); + +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); Review Comment: we do N * M. where N = columns to index. and M = tuple (partition, filename, isDeleted). So, we don't need it here. you can check this method getColumnStatsRecords(partitionId, filePathWithPartition, dataTableMetaClient, columnsToIndex, isDeleted).iterator(); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6670] Fix timeline check in metadata table validator (#9405)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 638e52d90ed [HUDI-6670] Fix timeline check in metadata table validator (#9405) 638e52d90ed is described below commit 638e52d90eda2d7c1e78a87f08427e5e3bf0a46c Author: Y Ethan Guo AuthorDate: Thu Aug 10 20:29:36 2023 -0700 [HUDI-6670] Fix timeline check in metadata table validator (#9405) --- .../org/apache/hudi/utilities/HoodieMetadataTableValidator.java | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java index d79957c735f..29e59df6935 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java @@ -491,10 +491,10 @@ public class HoodieMetadataTableValidator implements Serializable { .setConf(jsc.hadoopConfiguration()).setBasePath(new Path(cfg.basePath, HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH).toString()) .setLoadActiveTimelineOnLoad(true) .build(); - int finishedInstants = mdtMetaClient.getActiveTimeline().filterCompletedInstants().countInstants(); + int finishedInstants = mdtMetaClient.getCommitsTimeline().filterCompletedInstants().countInstants(); if (finishedInstants == 0) { -if (metaClient.getActiveTimeline().filterCompletedInstants().countInstants() == 0) { - LOG.info("There is no completed instant both in metadata table and corresponding data table."); +if (metaClient.getCommitsTimeline().filterCompletedInstants().countInstants() == 0) { + LOG.info("There is no completed commit in both metadata table and corresponding data table."); return false; } else { throw new HoodieValidationException("There is no completed instant for metadata table.");
[GitHub] [hudi] yihua merged pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator
yihua merged PR #9405: URL: https://github.com/apache/hudi/pull/9405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #9405: [HUDI-6670] Fix timeline check in metadata table validator
yihua commented on PR #9405: URL: https://github.com/apache/hudi/pull/9405#issuecomment-1674170095 Azure CI timeout is irrelevant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
Zouxxyy commented on code in PR #9416: URL: https://github.com/apache/hudi/pull/9416#discussion_r1290859610 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -452,107 +431,137 @@ private Stream getCommitInstantsToArchive() throws IOException { ? CompactionUtils.getOldestInstantToRetainForCompaction( table.getActiveTimeline(), config.getInlineCompactDeltaCommitMax()) : Option.empty(); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction); - // The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, + // 3. The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, // without the replaced files metadata on the timeline, the fs view would expose duplicates for readers. // Meanwhile, when inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline // to check whether the file slice generated in pending clustering after archive isn't committed. Option oldestInstantToRetainForClustering = ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(), table.getMetaClient()); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering); + + // 4. If metadata table is enabled, do not archive instants which are more recent than the last compaction on the + // metadata table. + if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) { +try (HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), config.getBasePath())) { + Option latestCompactionTime = tableMetadata.getLatestCompactionTime(); + if (!latestCompactionTime.isPresent()) { +LOG.info("Not archiving as there is no compaction yet on the metadata table"); +return Collections.emptyList(); + } else { +LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get()); +oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant( +HoodieInstant.State.COMPLETED, COMPACTION_ACTION, latestCompactionTime.get(; + } +} catch (Exception e) { + throw new HoodieException("Error limiting instant archival based on metadata table", e); +} + } + + // 5. If this is a metadata table, do not archive the commits that live in data set + // active timeline. This is required by metadata table, + // see HoodieTableMetadataUtil#processRollbackMetadata for details. + if (table.isMetadataTable()) { +HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder() + .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath())) +.setConf(metaClient.getHadoopConf()) +.build(); +Option qualifiedEarliestInstant = +TimelineUtils.getEarliestInstantForMetadataArchival( +dataMetaClient.getActiveTimeline(), config.shouldArchiveBeyondSavepoint()); + +// Do not archive the instants after the earliest commit (COMMIT, DELTA_COMMIT, and +// REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive +// beyond savepoint) and the earliest inflight instant (all actions). +// This is required by metadata table, see HoodieTableMetadataUtil#processRollbackMetadata +// for details. +// Todo: Remove #7580 Review Comment: After this PR, #7580 is no useful, consider remove or simplify it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
yihua commented on code in PR #9421: URL: https://github.com/apache/hudi/pull/9421#discussion_r1290853583 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java: ## @@ -217,7 +217,7 @@ public static Pair> filterAndGenerateChe row = collectedRows.select(queryInfo.getOrderColumn(), queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy( col(queryInfo.getOrderColumn()).desc(), col(queryInfo.getKeyColumn()).desc()).first(); } -LOG.info("Processed batch size: " + row.getLong(2) + " bytes"); +LOG.info("Processed batch size: " + row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes"); Review Comment: Got it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] JingFengWang opened a new issue, #9424: 'read.utc-timezone=false' has no effect on writes
JingFengWang opened a new issue, #9424: URL: https://github.com/apache/hudi/issues/9424 **_Tips before filing an issue_** hudi 0.14.0 hudi-flink-bundle The COW/MOR table type writes timestamp data, and the time zone for writing data when read.utc-timezone=false is set is still the UTC time zone. AvroToRowDataConverters and RowDataToAvroConverters timestamp time zone conversion is hardcoded to UTC time zone. **Describe the problem you faced** 1. hudi-flink1.13-bundle-0.14.0-rc1 write timestamp does not support configuration time zone type 2. The read.utc-timezone attribute only takes effect when the data is read **To Reproduce** Steps to reproduce the behavior: 1. ./bin/sql-client.sh embedded -j hudi-flink1.13-bundle-0.14.0-rc1.jar shell 2. When setting read.utc-timezone=true, it is normal to write query timestamp data 3. When setting read.utc-timezone= false to write data, the query time will be -8 hours ```sql Flink SQL> select LOCALTIMESTAMP as tm, timestamph from hudi_mor_all_datatype_2 where inth=44; ++-+-+ | op | tm | timestamph | ++-+-+ | +I | 2023-08-11 10:36:38.793 | 2023-08-11 03:10:17.267 | ++-+-+ ``` **Expected behavior** hudi-flink1.13-bundle supports writing timestamps in non-UTC time zones in a configurable way **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.2.0 * Flink version: 1.13.2 * Hive version : 1.11.1 * Hadoop version : 3.x * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Related code location** ```java public class AvroToRowDataConverters { // ... private static AvroToRowDataConverter createTimestampConverter(int precision) { // ... return avroObject -> { final Instant instant; if (avroObject instanceof Long) { instant = Instant.EPOCH.plus((Long) avroObject, chronoUnit); } else if (avroObject instanceof Instant) { instant = (Instant) avroObject; } else { JodaConverter jodaConverter = JodaConverter.getConverter(); if (jodaConverter != null) { // joda time has only millisecond precision instant = Instant.ofEpochMilli(jodaConverter.convertTimestamp(avroObject)); } else { throw new IllegalArgumentException( "Unexpected object type for TIMESTAMP logical type. Received: " + avroObject); } } // TODO:Hardcoded to UTC here return TimestampData.fromInstant(instant); }; } // ... } public class RowDataToAvroConverters { // ... public static RowDataToAvroConverter createConverter(LogicalType type) { // ... case TIMESTAMP_WITHOUT_TIME_ZONE: final int precision = DataTypeUtils.precision(type); if (precision <= 3) { converter = new RowDataToAvroConverter() { private static final long serialVersionUID = 1L; @Override public Object convert(Schema schema, Object object) { // TODO:Hardcoded to UTC here return ((TimestampData) object).toInstant().toEpochMilli(); } }; } else if (precision <= 6) { converter = new RowDataToAvroConverter() { private static final long serialVersionUID = 1L; @Override public Object convert(Schema schema, Object object) { // TODO:Hardcoded to UTC here Instant instant = ((TimestampData) object).toInstant(); return Math.addExact(Math.multiplyExact(instant.getEpochSecond(), 1000_000), instant.getNano() / 1000); } }; } else { throw new UnsupportedOperationException("Unsupported timestamp precision: " + precision); } break; // ... } // ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9344: [SUPPORT] Getting error when writing to different HUDI tables in different threads in same job
danny0405 commented on issue #9344: URL: https://github.com/apache/hudi/issues/9344#issuecomment-1674158246 I'm assuming you are using the MDT, did you check the existence of the missing file: ```xml ... 1 more Caused by: java.io.FileNotFoundException: No such file or directory: s3a://***/hudi_parallel_process/assets/asset_group/c9a7b1d3-c065-4902-a605-0fc114f33b2c-0_0-370-76132_20230801080422725.parquet ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Unify class name of Spark Procedure (#9414)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new d288d97fb40 [MINOR] Unify class name of Spark Procedure (#9414) d288d97fb40 is described below commit d288d97fb4031e71afce6ee3cfe7c286f3204e76 Author: Kunni AuthorDate: Fri Aug 11 10:57:48 2023 +0800 [MINOR] Unify class name of Spark Procedure (#9414) --- .../{CopyToTempView.scala => CopyToTempViewProcedure.scala} | 8 .../spark/sql/hudi/command/procedures/HoodieProcedures.scala | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala similarity index 95% rename from hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala rename to hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala index 89c00dac6e4..a23eea1363e 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempView.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/CopyToTempViewProcedure.scala @@ -24,7 +24,7 @@ import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} import java.util.function.Supplier -class CopyToTempView extends BaseProcedure with ProcedureBuilder with Logging { +class CopyToTempViewProcedure extends BaseProcedure with ProcedureBuilder with Logging { private val PARAMETERS = Array[ProcedureParameter]( ProcedureParameter.required(0, "table", DataTypes.StringType), @@ -102,13 +102,13 @@ class CopyToTempView extends BaseProcedure with ProcedureBuilder with Logging { Seq(Row(0)) } - override def build = new CopyToTempView() + override def build = new CopyToTempViewProcedure() } -object CopyToTempView { +object CopyToTempViewProcedure { val NAME = "copy_to_temp_view" def builder: Supplier[ProcedureBuilder] = new Supplier[ProcedureBuilder] { -override def get() = new CopyToTempView() +override def get() = new CopyToTempViewProcedure() } } diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala index d54c9811925..ad63ddbb29e 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/HoodieProcedures.scala @@ -84,7 +84,7 @@ object HoodieProcedures { ,(ValidateHoodieSyncProcedure.NAME, ValidateHoodieSyncProcedure.builder) ,(ShowInvalidParquetProcedure.NAME, ShowInvalidParquetProcedure.builder) ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder) - ,(CopyToTempView.NAME, CopyToTempView.builder) + ,(CopyToTempViewProcedure.NAME, CopyToTempViewProcedure.builder) ,(ShowCommitExtraMetadataProcedure.NAME, ShowCommitExtraMetadataProcedure.builder) ,(ShowTablePropertiesProcedure.NAME, ShowTablePropertiesProcedure.builder) ,(HelpProcedure.NAME, HelpProcedure.builder)
[GitHub] [hudi] danny0405 merged pull request #9414: [MINOR] Unify class name of Spark Procedure
danny0405 merged PR #9414: URL: https://github.com/apache/hudi/pull/9414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed issue #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit
danny0405 closed issue #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit URL: https://github.com/apache/hudi/issues/9420 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit
danny0405 commented on issue #9420: URL: https://github.com/apache/hudi/issues/9420#issuecomment-1674155705 Fixed in https://github.com/apache/hudi/pull/9421. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (e6d1e419c99 -> 6a8f00a1820)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from e6d1e419c99 [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423) add 6a8f00a1820 [HUDI-6680] Fixing the info log to fetch column value by name instead of index (#9421) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[GitHub] [hudi] danny0405 merged pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
danny0405 merged PR #9421: URL: https://github.com/apache/hudi/pull/9421 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9413: [HUDI-6675] Fix Clean action will delete the whole table
danny0405 commented on code in PR #9413: URL: https://github.com/apache/hudi/pull/9413#discussion_r1290848015 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java: ## @@ -147,7 +148,9 @@ List clean(HoodieEngineContext context, HoodieCleanerPlan clean List partitionsToBeDeleted = cleanerPlan.getPartitionsToBeDeleted() != null ? cleanerPlan.getPartitionsToBeDeleted() : new ArrayList<>(); partitionsToBeDeleted.forEach(entry -> { try { -deleteFileAndGetResult(table.getMetaClient().getFs(), table.getMetaClient().getBasePath() + "/" + entry); +if (!StringUtils.isNullOrEmpty(entry)) { + deleteFileAndGetResult(table.getMetaClient().getFs(), table.getMetaClient().getBasePath() + "/" + entry); Review Comment: Kind of think the `cleanerPlan.getPartitionsToBeDeleted()` should be fixed, can we write a test case for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
Danny Chen created HUDI-6683: Summary: Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource Key: HUDI-6683 URL: https://issues.apache.org/jira/browse/HUDI-6683 Project: Apache Hudi Issue Type: New Feature Components: deltastreamer Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e6d1e419c99 [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423) e6d1e419c99 is described below commit e6d1e419c99f8226c831b6ccbcd22b07510f0fbc Author: Sagar Sumit AuthorDate: Fri Aug 11 08:12:38 2023 +0530 [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423) --- azure-pipelines-20230430.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml index 75c231b74dc..2da5ab0d4f9 100644 --- a/azure-pipelines-20230430.yml +++ b/azure-pipelines-20230430.yml @@ -188,7 +188,7 @@ stages: displayName: Top 100 long-running testcases - job: UT_FT_4 displayName: UT FT other modules -timeoutInMinutes: '180' +timeoutInMinutes: '240' steps: - task: Maven@4 displayName: maven install
[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
danny0405 commented on code in PR #9412: URL: https://github.com/apache/hudi/pull/9412#discussion_r1290844141 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala: ## @@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase { } } + test("Test create table like") { +if (HoodieSparkUtils.gteqSpark3_1) { + // 1. Test create table from an existing HUDI table + withTempDir { tmp => Review Comment: We should avoid misusage of possible, or make it clear on the document. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours
nsivabalan merged PR #9423: URL: https://github.com/apache/hudi/pull/9423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours
danny0405 commented on PR #9423: URL: https://github.com/apache/hudi/pull/9423#issuecomment-1674148873 4 hours is quite long, not sure we should do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] asyncService log prompt incomplete (#9407)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 28d43f3a4f9 [MINOR] asyncService log prompt incomplete (#9407) 28d43f3a4f9 is described below commit 28d43f3a4f92c4996712cdb5abc13e0b2b7897e8 Author: empcl <1515827...@qq.com> AuthorDate: Fri Aug 11 10:38:10 2023 +0800 [MINOR] asyncService log prompt incomplete (#9407) --- .../src/main/java/org/apache/hudi/async/HoodieAsyncService.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java index 4c1dddf265e..f022e710456 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java @@ -196,11 +196,11 @@ public abstract class HoodieAsyncService implements Serializable { } /** - * Enqueues new pending clustering instant. + * Enqueues new pending table service instant. * @param instant {@link HoodieInstant} to enqueue. */ public void enqueuePendingAsyncServiceInstant(HoodieInstant instant) { -LOG.info("Enqueuing new pending clustering instant: " + instant.getTimestamp()); +LOG.info("Enqueuing new pending table service instant: " + instant.getTimestamp()); pendingInstants.add(instant); }
[GitHub] [hudi] danny0405 merged pull request #9407: asyncService log prompt incomplete
danny0405 merged PR #9407: URL: https://github.com/apache/hudi/pull/9407 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6682) Redistribute Azure CI test modules to reduce overall time for UT FT other module
Sagar Sumit created HUDI-6682: - Summary: Redistribute Azure CI test modules to reduce overall time for UT FT other module Key: HUDI-6682 URL: https://issues.apache.org/jira/browse/HUDI-6682 Project: Apache Hudi Issue Type: Task Reporter: Sagar Sumit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope opened a new pull request, #9423: [MINOR] Increase CI timeout for UT FT other modules to 4 hours
codope opened a new pull request, #9423: URL: https://github.com/apache/hudi/pull/9423 ### Change Logs UT FT other modules consistenly taking more than 3 hours. HUDI-6682 to track better redistribution of tests. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9384: [SUPPORT] TransactionParticipant not getting created
danny0405 commented on issue #9384: URL: https://github.com/apache/hudi/issues/9384#issuecomment-1674140682 Not quite sure, but the jar you used seems requiring the TLS authentication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on issue #8843: Memory leak caused by hudi if got exception when constructing record reader
voonhous commented on issue #8843: URL: https://github.com/apache/hudi/issues/8843#issuecomment-1674139125 Refer to stack trace here: https://github.com/apache/hudi/pull/8839#issuecomment-1674138771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6679] Fix initialization of metadata table partitions upon failure (#9419)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8ccd7da2936 [HUDI-6679] Fix initialization of metadata table partitions upon failure (#9419) 8ccd7da2936 is described below commit 8ccd7da293620ee94fb08035c04ddc595651332f Author: Y Ethan Guo AuthorDate: Thu Aug 10 19:17:07 2023 -0700 [HUDI-6679] Fix initialization of metadata table partitions upon failure (#9419) --- .../hudi/client/BaseHoodieTableServiceClient.java | 8 +- .../metadata/HoodieBackedTableMetadataWriter.java | 7 +- .../functional/TestHoodieBackedMetadata.java | 123 - 3 files changed, 128 insertions(+), 10 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java index e55fb045e1e..7e78bddd875 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java @@ -57,7 +57,6 @@ import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.exception.HoodieLogCompactException; import org.apache.hudi.exception.HoodieRollbackException; -import org.apache.hudi.metadata.HoodieTableMetadata; import org.apache.hudi.metadata.HoodieTableMetadataWriter; import org.apache.hudi.table.HoodieTable; import org.apache.hudi.table.action.HoodieWriteMetadata; @@ -88,6 +87,7 @@ import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION; import static org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN; import static org.apache.hudi.common.util.ValidationUtils.checkArgument; +import static org.apache.hudi.metadata.HoodieTableMetadata.isMetadataTable; import static org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit; /** @@ -932,8 +932,10 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl LinkedHashMap> reverseSortedRollbackInstants = instantsToRollback.entrySet() .stream().sorted((i1, i2) -> i2.getKey().compareTo(i1.getKey())) .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new)); +boolean isMetadataTable = isMetadataTable(basePath); for (Map.Entry> entry : reverseSortedRollbackInstants.entrySet()) { - if (HoodieTimeline.compareTimestamps(entry.getKey(), HoodieTimeline.LESSER_THAN_OR_EQUALS, + if (!isMetadataTable + && HoodieTimeline.compareTimestamps(entry.getKey(), HoodieTimeline.LESSER_THAN_OR_EQUALS, HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS)) { // do we need to handle failed rollback of a bootstrap rollbackFailedBootstrap(); @@ -954,7 +956,7 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl // from the async indexer (`HoodieIndexer`). // TODO(HUDI-5733): This should be cleaned up once the proper fix of rollbacks in the // metadata table is landed. - if (HoodieTableMetadata.isMetadataTable(metaClient.getBasePathV2().toString())) { + if (isMetadataTable(metaClient.getBasePathV2().toString())) { return inflightInstantsStream.map(HoodieInstant::getTimestamp).filter(entry -> { if (curInstantTime.isPresent()) { return !entry.equals(curInstantTime.get()); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 4f965e587cb..74d8ae16176 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -112,7 +112,6 @@ import static org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deseri import static org.apache.hudi.metadata.HoodieTableMetadata.METADATA_TABLE_NAME_SUFFIX; import static org.apache.hudi.metadata.HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP; import static org.apache.hudi.metadata.HoodieTableMetadataUtil.createRollbackTimestamp; -import static org.apache.hudi.metadata.HoodieTableMetadataUtil.getInflightAndCompletedMetadataPartitions; import static org.apache.hudi.metadata.HoodieTableMetadataUtil.getInflightMetadataPartitions; /** @@ -257,10 +256,10 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM // check if any of the enabl
[GitHub] [hudi] voonhous commented on pull request #8839: [HUDI-6287] Fix Memory Leak in RealtimeCompactedRecordReader
voonhous commented on PR #8839: URL: https://github.com/apache/hudi/pull/8839#issuecomment-1674138771 ```text 2023-08-11T00:17:44.546+0800WARN 20230810_161734_00541_uhtxz.1.104.0-48-1048 org.apache.hadoop.hdfs.client.impl.BlockReaderFactory I/O error constructing remote block reader. java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:82) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:547) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:407) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:649) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:580) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:762) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:834) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:686) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:82) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:82) at org.apache.parquet.io.DelegatingSeekableInputStream.read(DelegatingSeekableInputStream.java:61) at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:548) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:528) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:522) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:470) at org.apache.hudi.common.table.TableSchemaResolver.readSchemaFromParquetBaseFile(TableSchemaResolver.java:349) at org.apache.hudi.common.table.TableSchemaResolver.readSchemaFromBaseFile(TableSchemaResolver.java:549) at org.apache.hudi.common.table.TableSchemaResolver.fetchSchemaFromFiles(TableSchemaResolver.java:541) at org.apache.hudi.common.table.TableSchemaResolver.getTableParquetSchemaFromDataFile(TableSchemaResolver.java:266) at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:119) at org.apache.hudi.common.table.TableSchemaResolver.hasOperationField(TableSchemaResolver.java:472) at org.apache.hudi.util.Lazy.get(Lazy.java:53) at org.apache.hudi.common.table.TableSchemaResolver.getTableSchemaFromLatestCommitMetadata(TableSchemaResolver.java:223) at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaInternal(TableSchemaResolver.java:191) at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchema(TableSchemaResolver.java:140) at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchema(TableSchemaResolver.java:129) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:144) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:96) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:64) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:81) at io.trino.plugin.hudi.HudiRecordCursors.createRecordReader(HudiRecordCursors.java:109) at io.trino.plugin.hudi.HudiRecordCursors.la
[GitHub] [hudi] codope merged pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
codope merged PR #9419: URL: https://github.com/apache/hudi/pull/9419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
codope commented on PR #9419: URL: https://github.com/apache/hudi/pull/9419#issuecomment-1674139030 I am landing it to save CI cycles. There are no failures in UT FT other modules. It's just timing out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
boneanxs commented on code in PR #9412: URL: https://github.com/apache/hudi/pull/9412#discussion_r1290833633 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala: ## @@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase { } } + test("Test create table like") { +if (HoodieSparkUtils.gteqSpark3_1) { + // 1. Test create table from an existing HUDI table + withTempDir { tmp => Review Comment: Spark2 will use Spark own `CreateTableLikeCommand`, we can't throw error here since we can't distinguish whether the user want to create hudi table or not. ```scala * The syntax of using this command in SQL is(it doesn't support pass targetTable's provider): * {{{ * CREATE TABLE [IF NOT EXISTS] [db_name.]table_name * LIKE [other_db_name.]existing_table_name [locationSpec] * }}} ``` Spark2 is becoming depreciated, maybe only supporting spark3+ is enough? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674135920 ## CI report: * f43453d4e334097d34f4606137247d217fdd253c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19254) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674127768 ## CI report: * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255) * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19256) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9415: [HUDI-6677] Make HoodieRecordIndexInfo schema compatible with older versions
danny0405 commented on code in PR #9415: URL: https://github.com/apache/hudi/pull/9415#discussion_r1290822177 ## hudi-common/src/main/avro/HoodieMetadata.avsc: ## @@ -369,7 +369,7 @@ "name": "HoodieRecordIndexInfo", "fields": [ { -"name": "partitionName", +"name": "partition", "type": [ Review Comment: Can we write a compatibility test for this class? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
danny0405 commented on code in PR #9412: URL: https://github.com/apache/hudi/pull/9412#discussion_r1290821705 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala: ## @@ -405,6 +405,145 @@ class TestCreateTable extends HoodieSparkSqlTestBase { } } + test("Test create table like") { +if (HoodieSparkUtils.gteqSpark3_1) { + // 1. Test create table from an existing HUDI table + withTempDir { tmp => Review Comment: So spark2 will throw exception ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8542: [HUDI-6123] Auto adjust lock configs only for single writer
danny0405 commented on code in PR #8542: URL: https://github.com/apache/hudi/pull/8542#discussion_r1290819723 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2483,8 +2483,15 @@ public boolean areReleaseResourceEnabled() { /** * Returns whether the explicit guard of lock is required. */ - public boolean needsLockGuard() { -return isMetadataTableEnabled() || getWriteConcurrencyMode().supportsOptimisticConcurrencyControl(); + public boolean isLockRequired() { +return !isDefaultLockProvider() || getWriteConcurrencyMode().supportsOptimisticConcurrencyControl(); Review Comment: Yeah, I was expecting the use to set up the optimistic concurrency control explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled
danny0405 commented on code in PR #8111: URL: https://github.com/apache/hudi/pull/8111#discussion_r1290818989 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java: ## @@ -140,8 +140,10 @@ public void testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType put(ASYNC_CLEAN.key(), "false"); put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true"); } -}), true, true, true, WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL, -HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName); +}), true, true, true, Review Comment: If metadata table is enabled, the lock should take effect, because the default lock provider class is `ZookeeperBasedLockProvider`, so at least, the in-progress lock should work. See `HoodieWriteConfig.isLockRequired`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled
danny0405 commented on code in PR #8111: URL: https://github.com/apache/hudi/pull/8111#discussion_r1290818989 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java: ## @@ -140,8 +140,10 @@ public void testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType put(ASYNC_CLEAN.key(), "false"); put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true"); } -}), true, true, true, WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL, -HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName); +}), true, true, true, Review Comment: If metadata table is enabled, the lock should take effect, because the default lock provider class is `ZookeeperBasedLockProvider`, so at least, the in-progress lock should work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674105473 ## CI report: * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255) * 5b6ebb1c3008db7f8b41ee8371358e21652b02fa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
hudi-bot commented on PR #9419: URL: https://github.com/apache/hudi/pull/9419#issuecomment-1674072360 ## CI report: * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19252) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
hudi-bot commented on PR #9421: URL: https://github.com/apache/hudi/pull/9421#issuecomment-1674067153 ## CI report: * 7cd01addabe76c50feb22f32c652a30be4902643 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19253) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] empcl closed pull request #9417: Database not found exception when resolving Spark synchronization hive
empcl closed pull request #9417: Database not found exception when resolving Spark synchronization hive URL: https://github.com/apache/hudi/pull/9417 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674038853 ## CI report: * 4b7280a248b923a107a71d7a741b971f140731e4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19255) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674038750 ## CI report: * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230) * f43453d4e334097d34f4606137247d217fdd253c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19254) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1674033378 ## CI report: * 4b7280a248b923a107a71d7a741b971f140731e4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
hudi-bot commented on PR #9409: URL: https://github.com/apache/hudi/pull/9409#issuecomment-1674033270 ## CI report: * d567d80ea610ed8eca248901d310bd40ae4bf8e5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19230) * f43453d4e334097d34f4606137247d217fdd253c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex commented on a diff in pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
jonvex commented on code in PR #9409: URL: https://github.com/apache/hudi/pull/9409#discussion_r1290738278 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -148,7 +148,7 @@ case class HoodieFileIndex(spark: SparkSession, override def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = { val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, partitionFilters).map { case (partitionOpt, fileSlices) => -if (shouldBroadcast) { +if (shouldEmbedFileSlices) { Review Comment: No it should not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly
[ https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6681: -- Status: Patch Available (was: In Progress) > Ensure MOR Column Stats Index skips reading filegroups correctly > > > Key: HUDI-6681 > URL: https://issues.apache.org/jira/browse/HUDI-6681 > Project: Apache Hudi > Issue Type: Test > Components: metadata, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Write tests to ensure Column Stats Index functions as expected for MOR tables -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly
[ https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6681: -- Status: In Progress (was: Open) > Ensure MOR Column Stats Index skips reading filegroups correctly > > > Key: HUDI-6681 > URL: https://issues.apache.org/jira/browse/HUDI-6681 > Project: Apache Hudi > Issue Type: Test > Components: metadata, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Write tests to ensure Column Stats Index functions as expected for MOR tables -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly
[ https://issues.apache.org/jira/browse/HUDI-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6681: - Labels: pull-request-available (was: ) > Ensure MOR Column Stats Index skips reading filegroups correctly > > > Key: HUDI-6681 > URL: https://issues.apache.org/jira/browse/HUDI-6681 > Project: Apache Hudi > Issue Type: Test > Components: metadata, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Write tests to ensure Column Stats Index functions as expected for MOR tables -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
jonvex opened a new pull request, #9422: URL: https://github.com/apache/hudi/pull/9422 ### Change Logs Create tests for MOR col stats index to ensure that filegroups are read as expected ### Impact Verification that the feature works ### Risk level (write none, low medium or high below) none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6681) Ensure MOR Column Stats Index skips reading filegroups correctly
Jonathan Vexler created HUDI-6681: - Summary: Ensure MOR Column Stats Index skips reading filegroups correctly Key: HUDI-6681 URL: https://issues.apache.org/jira/browse/HUDI-6681 Project: Apache Hudi Issue Type: Test Components: metadata, spark Reporter: Jonathan Vexler Assignee: Jonathan Vexler Write tests to ensure Column Stats Index functions as expected for MOR tables -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1673969724 ## CI report: * 533117e9428e103df8d8d94dad393c1961df4152 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19251) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #8542: [HUDI-6123] Auto adjust lock configs only for single writer
yihua commented on PR #8542: URL: https://github.com/apache/hudi/pull/8542#issuecomment-1673934722 > For multiple streaming writers with no explicit lock provider set up, InProcessLockProvider should not be used. In this case, user should explicitly set the lock provider as mentioned in the [docs](https://hudi.apache.org/docs/metadata#deployment-model-c-multi-writer). Auto config adjustment does not intend to solve this problem. Also, we need to update the docs. This PR brings breaking changes to how configs work for the metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokesh-lingarajan-0310 commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
lokesh-lingarajan-0310 commented on code in PR #9421: URL: https://github.com/apache/hudi/pull/9421#discussion_r1290689371 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java: ## @@ -217,7 +217,7 @@ public static Pair> filterAndGenerateChe row = collectedRows.select(queryInfo.getOrderColumn(), queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy( col(queryInfo.getOrderColumn()).desc(), col(queryInfo.getKeyColumn()).desc()).first(); } -LOG.info("Processed batch size: " + row.getLong(2) + " bytes"); +LOG.info("Processed batch size: " + row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes"); Review Comment: we hit class cast exception in some cases where spark inferred this field as double -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
hudi-bot commented on PR #9421: URL: https://github.com/apache/hudi/pull/9421#issuecomment-1673926017 ## CI report: * 7cd01addabe76c50feb22f32c652a30be4902643 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19253) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
hudi-bot commented on PR #9419: URL: https://github.com/apache/hudi/pull/9419#issuecomment-1673925975 ## CI report: * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19252) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
hudi-bot commented on PR #9419: URL: https://github.com/apache/hudi/pull/9419#issuecomment-1673916580 ## CI report: * 060ce5fe9068a6b38382735d7aa60f3cd40c7e16 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
hudi-bot commented on PR #9421: URL: https://github.com/apache/hudi/pull/9421#issuecomment-1673916696 ## CI report: * 7cd01addabe76c50feb22f32c652a30be4902643 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8111: [HUDI-5887] Should not mark the concurrency mode as OCC by default when MDT is enabled
yihua commented on code in PR #8111: URL: https://github.com/apache/hudi/pull/8111#discussion_r1290679214 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java: ## @@ -140,8 +140,10 @@ public void testAutoConcurrencyConfigAdjustmentWithTableServices(HoodieTableType put(ASYNC_CLEAN.key(), "false"); put(HoodieWriteConfig.AUTO_ADJUST_LOCK_CONFIGS.key(), "true"); } -}), true, true, true, WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL, -HoodieFailedWritesCleaningPolicy.LAZY, inProcessLockProviderClassName); +}), true, true, true, Review Comment: We should revert the changes in this PR to a degree that the auto-adjustment of the lock configs still works for single-writer with async table services. Right now, auto-adjustment of the lock configs does not work for Deltastreamer with async table services when the metadata table is enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9417: Database not found exception when resolving Spark synchronization hive
hudi-bot commented on PR #9417: URL: https://github.com/apache/hudi/pull/9417#issuecomment-1673905253 ## CI report: * 331d018c7d8b69232742aaee3a16062f692226ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19249) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6680) Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit
[ https://issues.apache.org/jira/browse/HUDI-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6680: - Labels: pull-request-available (was: ) > Fixing the info log to fetch column value by name instead of index in > function filterAndGenerateCheckpointBasedOnSourceLimit > > > Key: HUDI-6680 > URL: https://issues.apache.org/jira/browse/HUDI-6680 > Project: Apache Hudi > Issue Type: Task >Reporter: Lokesh Lingarajan >Priority: Major > Labels: pull-request-available > > Sometime spark inference engine identifies the cumulative column as type > double and this causes class cast exception trying to fetch it as Long. > Reference - > https://github.com/apache/hudi/blob/dcf466fa48c2d54e490255bcb27f58adba7c1583/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L220 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on a diff in pull request #9421: [HUDI-6680] - Fixing the info log to fetch column value by name instead of index
yihua commented on code in PR #9421: URL: https://github.com/apache/hudi/pull/9421#discussion_r1290663391 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java: ## @@ -217,7 +217,7 @@ public static Pair> filterAndGenerateChe row = collectedRows.select(queryInfo.getOrderColumn(), queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy( col(queryInfo.getOrderColumn()).desc(), col(queryInfo.getKeyColumn()).desc()).first(); } -LOG.info("Processed batch size: " + row.getLong(2) + " bytes"); +LOG.info("Processed batch size: " + row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes"); Review Comment: I think the logic is correct before. Just that we should not hard code the column position. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6680) Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit
Lokesh Lingarajan created HUDI-6680: --- Summary: Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLimit Key: HUDI-6680 URL: https://issues.apache.org/jira/browse/HUDI-6680 Project: Apache Hudi Issue Type: Task Reporter: Lokesh Lingarajan Sometime spark inference engine identifies the cumulative column as type double and this causes class cast exception trying to fetch it as Long. Reference - https://github.com/apache/hudi/blob/dcf466fa48c2d54e490255bcb27f58adba7c1583/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L220 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] lokesh-lingarajan-0310 opened a new pull request, #9421: [9420] - Fixing the info log to fetch column value by name instead of index
lokesh-lingarajan-0310 opened a new pull request, #9421: URL: https://github.com/apache/hudi/pull/9421 ### Change Logs Fixing the log statement to fetch column value by name instead of index ### Impact low ### Risk level (write none, low medium or high below) low ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokesh-lingarajan-0310 opened a new issue, #9420: [SUPPORT] - Fixing the info log to fetch column value by name instead of index in function filterAndGenerateCheckpointBasedOnSourceLim
lokesh-lingarajan-0310 opened a new issue, #9420: URL: https://github.com/apache/hudi/issues/9420 Sometime spark inference engine identifies the cumulative column as type double and this causes class cast exception trying to fetch it as Long. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9409: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices
yihua commented on code in PR #9409: URL: https://github.com/apache/hudi/pull/9409#discussion_r1290641231 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -148,7 +148,7 @@ case class HoodieFileIndex(spark: SparkSession, override def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = { val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, partitionFilters).map { case (partitionOpt, fileSlices) => -if (shouldBroadcast) { +if (shouldEmbedFileSlices) { Review Comment: A side question: can `shouldEmbedFileSlices` be `true` for legacy file format as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6679: - Labels: pull-request-available (was: ) > Fix initialization of metadata table partitions upon failure > > > Key: HUDI-6679 > URL: https://issues.apache.org/jira/browse/HUDI-6679 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > When both files and record_index partitions are enabled, for the first commit > in the data table, the transaction fails when initializing the second > partition in the MDT. In this case, the timelines look like below. In this > case, when restarting the pipeline, the rollback triggers irrelevant > bootstrap rollback logic causing MDT to be corrupted, not properly > re-initializing the record_index partition. > DT > {code:java} > .commit.requested > .commit.inflight {code} > MDT > {code:java} > 00010.deltacommit.requested > 00010.deltacommit.inflight > 00010.deltacommit > 00011.deltacommit.requested > 00011.deltacommit.inflight{code} > Afterwards > {code:java} > ╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗ > ║ No. │ Instant │ Action │ State │ Requested > │ Inflight │ Completed │ MT │ MT │ MT > │ MT │ MT ║ > ║ │ │ │ │ Time > │ Time │ Time │ Action │ State │ Requested > │ Inflight │ Completed ║ > ║ │ │ │ │ > │ │ │ │ │ Time > │ Time │ Time ║ > ╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣ > ║ 0 │ 20230807063905364 │ rollback │ COMPLETED │ 08-06 23:39:06 > │ 08-06 23:39:07 │ 08-06 23:40:38 │ - │ - │ - > │ - │ - ║ > ║ │ │ Rolls back │ │ > │ │ │ │ │ > │ │ ║ > ║ │ │ 20230807063647472 │ │ > │ │ │ │ │ > │ │ ║ > ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ > ║ 1 │ 20230807063905364010 │ - │ - │ - > │ - │ - │ deltacommit │ COMPLETED │ 08-06 23:40:49 > │ 08-06 23:40:49 │ 08-06 23:40:51 ║ > ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ > ║ 2 │ 20230807064006967 │ deltacommit │ REQUESTED │ 08-06 23:40:39 > │ - │ - │ - │ - │ - > │ - │ - ║ > ║ │ │ Rolled back by │ │ > │ │ │ │ │ > │ │ ║ > ║ │ │ 20230807064227290 │ │ > │ │ │ │ │ > │ │ ║ > ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ > ║ 3 │ 20230807064041714 │ - │ - │ - > │ - │ - │ restore │ COMPLETED │ 08-06 23:40:43 > │ 08-06 23:40:43 │ 08-06 23:40:48 ║ > ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ > ║ 4 │ 20230807064227290 │ rollback │ INFLIGHT │ 08-06 23:42:28 > │ 08-06 23:42:29 │ - │ - │ - │ - > │ - │ - ║ > ║
[GitHub] [hudi] yihua opened a new pull request, #9419: [HUDI-6679] Fix initialization of metadata table partitions upon failure
yihua opened a new pull request, #9419: URL: https://github.com/apache/hudi/pull/9419 ### Change Logs This PR fixes initialization of metadata table partitions upon failure: - In `BaseHoodieTableServiceClient.rollbackFailedWrites`, the fix avoids bootstrap rollback logic for MDT as MDT is never a bootstrap table and such logic can be accidentally triggered, since the MDT initial commits, e.g., `00010`, `00011`, are smaller than `FULL_BOOTSTRAP_INSTANT_TS` (`02`). - In `HoodieBackedTableMetadataWriter.initializeIfNeeded`, when async metadata indexing is disabled, if a partition is inflight, it means that the partition is not fully initialized, so the initialization should be triggered again. This scenario fails before the fix: When both files and record_index partitions are enabled, for the first commit in the data table, the transaction fails when initializing the second partition in the MDT. In this case, the timelines look like below. In this case, when restarting the pipeline, the rollback triggers irrelevant bootstrap rollback logic causing MDT to be corrupted, not properly re-initializing the record_index partition. DT ``` .commit.requested .commit.inflight ``` MDT ``` 00010.deltacommit.requested 00010.deltacommit.inflight 00010.deltacommit 00011.deltacommit.requested 00011.deltacommit.inflight ``` Afterwards, `00010` is rolled back and bootstrap rollback logic adding restore kicks in, which are unexpected. ``` ╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗ ║ No. │ Instant │ Action│ State │ Requested │ Inflight │ Completed │ MT │ MT│ MT │ MT │ MT ║ ║ │ │ │ │ Time │ Time │ Time │ Action │ State │ Requested │ Inflight │ Completed ║ ║ │ │ │ │ │││ │ │ Time │ Time │ Time ║ ╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣ ║ 0 │ 20230807063905364│ rollback │ COMPLETED │ 08-06 23:39:06 │ 08-06 23:39:07 │ 08-06 23:40:38 │ - │ - │ - │ - │ - ║ ║ │ │ Rolls back│ │ │││ │ ││ │║ ║ │ │ 20230807063647472 │ │ │││ │ ││ │║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 1 │ 20230807063905364010 │ - │ - │ - │ - │ - │ deltacommit │ COMPLETED │ 08-06 23:40:49 │ 08-06 23:40:49 │ 08-06 23:40:51 ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 2 │ 20230807064006967│ deltacommit │ REQUESTED │ 08-06 23:40:39 │ - │ - │ - │ - │ - │ - │ - ║ ║ │ │ Rolled back by│ │ │││ │ ││ │║ ║ │ │ 20230807064227290 │ │ │││ │ ││ │║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 3 │ 20230807064041714│ - │ - │ - │ - │ - │ restore │ COMPLETED │ 08-06 23:40:43 │ 08-06 23:40:43 │ 08-06 23:40:48 ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 4 │ 20230807064227290│ r
[GitHub] [hudi] hudi-bot commented on pull request #9403: [MINOR] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
hudi-bot commented on PR #9403: URL: https://github.com/apache/hudi/pull/9403#issuecomment-1673822139 ## CI report: * b5846de9f43070cf38acd5bd90ae990cad1c2999 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19247) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6679: Description: When both files and record_index partitions are enabled, for the first commit in the data table, the transaction fails when initializing the second partition in the MDT. In this case, the timelines look like below. In this case, when restarting the pipeline, the rollback triggers irrelevant bootstrap rollback logic causing MDT to be corrupted, not properly re-initializing the record_index partition. DT {code:java} .commit.requested .commit.inflight {code} MDT {code:java} 00010.deltacommit.requested 00010.deltacommit.inflight 00010.deltacommit 00011.deltacommit.requested 00011.deltacommit.inflight{code} Afterwards {code:java} ╔═╤══╤═══╤═══╤╤╤╤═╤═══╤╤╤╗ ║ No. │ Instant │ Action │ State │ Requested │ Inflight │ Completed │ MT │ MT │ MT │ MT │ MT ║ ║ │ │ │ │ Time │ Time │ Time │ Action │ State │ Requested │ Inflight │ Completed ║ ║ │ │ │ │ │ │ │ │ │ Time │ Time │ Time ║ ╠═╪══╪═══╪═══╪╪╪╪═╪═══╪╪╪╣ ║ 0 │ 20230807063905364 │ rollback │ COMPLETED │ 08-06 23:39:06 │ 08-06 23:39:07 │ 08-06 23:40:38 │ - │ - │ - │ - │ - ║ ║ │ │ Rolls back │ │ │ │ │ │ │ │ │ ║ ║ │ │ 20230807063647472 │ │ │ │ │ │ │ │ │ ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 1 │ 20230807063905364010 │ - │ - │ - │ - │ - │ deltacommit │ COMPLETED │ 08-06 23:40:49 │ 08-06 23:40:49 │ 08-06 23:40:51 ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 2 │ 20230807064006967 │ deltacommit │ REQUESTED │ 08-06 23:40:39 │ - │ - │ - │ - │ - │ - │ - ║ ║ │ │ Rolled back by │ │ │ │ │ │ │ │ │ ║ ║ │ │ 20230807064227290 │ │ │ │ │ │ │ │ │ ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 3 │ 20230807064041714 │ - │ - │ - │ - │ - │ restore │ COMPLETED │ 08-06 23:40:43 │ 08-06 23:40:43 │ 08-06 23:40:48 ║ ╟─┼──┼───┼───┼┼┼┼─┼───┼┼┼╢ ║ 4 │ 20230807064227290 │ rollback │ INFLIGHT │ 08-06 23:42:28 │ 08-06 23:42:29 │ - │ - │ - │ - │ - │ - ║ ║ │ │ Rolls back │ │ │ │ │ │ │ │ │ ║ ║ │ │ 20230807064006967 │ │ │ │ │ │ │ │ │ ║ ╚═╧══╧═══╧═══╧╧╧╧═╧═══╧╧╧╝ {code} {code:java} org.apache.hudi.exception.HoodieR
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6679: Description: When both files and record_index partitions are enabled, for the first commit in the data table, the transaction fails when initializing the second partition in the MDT. In this case, the timelines look like below. In this case, when restarting the pipeline, the rollback triggers irrelevant bootstrap rollback logic causing MDT to be corrupted. DT {code:java} .commit.requested .commit.inflight {code} MDT {code:java} 00010.deltacommit.requested 00010.deltacommit.inflight 00010.deltacommit 00011.deltacommit.requested 00011.deltacommit.inflight{code} {code:java} org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3a:///hoodie_table commits 20230807064006967 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920) at org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890) at org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767) at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445) at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT partition files should be >0 at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013) at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901) ... 16 more {code} was: {code:java} org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3a:///hoodie_table commits 20230807064006967 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) at org.apache.hudi.client.BaseHoodieTableServiceClient.r
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6679: Priority: Blocker (was: Major) > Fix initialization of metadata table partitions upon failure > > > Key: HUDI-6679 > URL: https://issues.apache.org/jira/browse/HUDI-6679 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Blocker > > > > {code:java} > org.apache.hudi.exception.HoodieRollbackException: Failed to rollback > s3a:///hoodie_table commits 20230807064006967 > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) > at > org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) > at > org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920) > at > org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT > partition files should be >0 > at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013) > at > org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) > at > org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901) > ... 16 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6679: Fix Version/s: 0.14.0 > Fix initialization of metadata table partitions upon failure > > > Key: HUDI-6679 > URL: https://issues.apache.org/jira/browse/HUDI-6679 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > > > {code:java} > org.apache.hudi.exception.HoodieRollbackException: Failed to rollback > s3a:///hoodie_table commits 20230807064006967 > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) > at > org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) > at > org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920) > at > org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT > partition files should be >0 > at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013) > at > org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) > at > org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901) > ... 16 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6679: --- Assignee: Ethan Guo > Fix initialization of metadata table partitions upon failure > > > Key: HUDI-6679 > URL: https://issues.apache.org/jira/browse/HUDI-6679 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.14.0 > > > > > {code:java} > org.apache.hudi.exception.HoodieRollbackException: Failed to rollback > s3a:///hoodie_table commits 20230807064006967 > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) > at > org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) > at > org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927) > at > org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920) > at > org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT > partition files should be >0 > at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135) > at > org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837) > at > org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013) > at > org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120) > at > org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) > at > org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901) > ... 16 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6679) Fix initialization of metadata table partitions upon failure
[ https://issues.apache.org/jira/browse/HUDI-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6679: Description: {code:java} org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3a:///hoodie_table commits 20230807064006967 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:927) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:920) at org.apache.hudi.utilities.streamer.StreamSync.startCommit(StreamSync.java:890) at org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:767) at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:445) at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:767) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT partition files should be >0 at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1098) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commitInternal(SparkHoodieBackedTableMetadataWriter.java:135) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:122) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:837) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:1013) at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:264) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:120) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:901) ... 16 more {code} > Fix initialization of metadata table partitions upon failure > > > Key: HUDI-6679 > URL: https://issues.apache.org/jira/browse/HUDI-6679 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > > > > {code:java} > org.apache.hudi.exception.HoodieRollbackException: Failed to rollback > s3a:///hoodie_table commits 20230807064006967 > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:918) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:865) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:739) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:723) > at > org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:718) > at > org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:928) > at > org.apache.hudi.common.util.CleanerUt
[jira] [Created] (HUDI-6679) Fix initialization of metadata table partitions upon failure
Ethan Guo created HUDI-6679: --- Summary: Fix initialization of metadata table partitions upon failure Key: HUDI-6679 URL: https://issues.apache.org/jira/browse/HUDI-6679 Project: Apache Hudi Issue Type: Bug Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)