[GitHub] [hudi] hudi-bot commented on pull request #9362: [HUDI-6644] Flink append mode use auto key generator
hudi-bot commented on PR #9362: URL: https://github.com/apache/hudi/pull/9362#issuecomment-1665035180 ## CI report: * be8d22a885fedffd0baa991470d0e04862b3c380 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19060) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1665035137 ## CI report: * 7308946c4344ca04736b2c83b505d3a159146541 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19050) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1665035080 ## CI report: * 5c1391571fbed3ce391399b7848f33b629455941 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057) * c24825896bd7893fe303f240b523bd20ae1e7567 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19059) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
hudi-bot commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1665034674 ## CI report: * 884a71af797b71a7f5818472884e45f39f758328 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19049) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9362: [HUDI-6644] Flink append mode use auto key generator
hudi-bot commented on PR #9362: URL: https://github.com/apache/hudi/pull/9362#issuecomment-1664960288 ## CI report: * be8d22a885fedffd0baa991470d0e04862b3c380 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0
hudi-bot commented on PR #9360: URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664960255 ## CI report: * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056) * c275941c84095c11d2290d3d9f24f33c91938844 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19058) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664960199 ## CI report: * 5c1391571fbed3ce391399b7848f33b629455941 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057) * c24825896bd7893fe303f240b523bd20ae1e7567 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0
hudi-bot commented on PR #9360: URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664955794 ## CI report: * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056) * c275941c84095c11d2290d3d9f24f33c91938844 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664955745 ## CI report: * 5c1391571fbed3ce391399b7848f33b629455941 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6644) Flink append mode use auto key generator
[ https://issues.apache.org/jira/browse/HUDI-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6644: - Labels: pull-request-available (was: ) > Flink append mode use auto key generator > > > Key: HUDI-6644 > URL: https://issues.apache.org/jira/browse/HUDI-6644 > Project: Apache Hudi > Issue Type: Task >Reporter: HBG >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hbgstc123 opened a new pull request, #9362: [HUDI-6644] Flink append mode use auto key generator
hbgstc123 opened a new pull request, #9362: URL: https://github.com/apache/hudi/pull/9362 ### Change Logs Support use auto key generator in flink append mode when user don't provide primary key to align with spark. Add a new class AutoRowDataKeyGen to used in AppendWriteFunction to support pk-less case. ### Impact none ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kepplertreet opened a new issue, #9361: [SUPPORT] Hudi Merge On Read Tables don't write Delta Log Files
kepplertreet opened a new issue, #9361: URL: https://github.com/apache/hudi/issues/9361 Hi. I'm using a Spark Structured Streaming Application running on EMR-6.11.0 to Write into a Hudi MOR Table. Hudi Version : 0.13.0 Spark Version : 3.3.2 ``` 'hoodie.table.name': , 'hoodie.datasource.write.recordkey.field': , 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.table.type': "MERGE_ON_READ", 'hoodie.datasource.write.partitionpath.field': , 'hoodie.datasource.write.table.name': , 'hoodie.datasource.write.precombine.field': , "hoodie.table.version": 5, "hoodie.datasource.write.commitmeta.key.prefix": "_", "hoodie.datasource.write.hive_style_partitioning": 'true', "hoodie.datasource.meta.sync.enable": 'false', "hoodie.datasource.hive_sync.enable": 'true', "hoodie.datasource.hive_sync.auto_create_database": 'true', "hoodie.datasource.hive_sync.skip_ro_suffix": 'true', "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.parquet.small.file.limit": 125217728, "hoodie.parquet.max.file.size": 134217728, # Compaction Configs "hoodie.compact.inline" : "false", "hoodie.compact.schedule.inline" : "false", "hoodie.datasource.compaction.async.enable": "true", "hoodie.compact.inline.trigger.strategy": "NUM_COMMITS", "hoodie.compact.inline.max.delta.commits": 3, # --- Cleaner Configs "hoodie.clean.automatic": 'true', "hoodie.clean.async": 'true', "hoodie.cleaner.policy.failed.writes": "LAZY", "hoodie.clean.trigger.strategy" : "NUM_COMMITS", "hoodie.clean.max.commits" : 7, "hoodie.cleaner.commits.retained" : 3, "hoodie.cleaner.fileversions.retained": 1, "hoodie.cleaner.hours.retained": 1, "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS", "hoodie.parquet.compression.codec": "snappy", "hoodie.embed.timeline.server": 'true', "hoodie.embed.timeline.server.async": 'false', "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL", "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider", "hoodie.index.type": "BLOOM", "hoodie.datasource.write.streaming.checkpoint.identifier" : , # Metadata Configs "hoodie.metadata.enable": 'true', "hoodie.bloom.index.use.metadata": 'true', "hoodie.metadata.index.async": 'false', "hoodie.metadata.clean.async": 'true', "hoodie.metadata.index.bloom.filter.enable": 'true', "hoodie.metadata.index.column.stats.enable" : 'true', "hoodie.metadata.index.bloom.filter.column.list": , "hoodie.metadata.index.column.stats.column.list" : , "hoodie.metadata.metrics.enable": 'true', "hoodie.keep.max.commits": 50, "hoodie.archive.async": 'true', "hoodie.archive.merge.enable": 'false', "hoodie.archive.beyond.savepoint": 'true', "hoodie.cleaner.policy": "KEEP_LATEST_BY_HOURS", "hoodie.cleaner.hours.retained": 1 ``` Issues Faced : As the configs show, we have OCC and Metadata enabled for the table. My only concern for now is that I never see log files being written into the main table and hence a compaction is never scheduled nor triggered for the main table i.e all incoming data is written directly into parquet files, whereas the metadata timeline show scheduling and execution of compaction and hence a **commit** is reflected into the timeline. Is this a normal expected behaviour? Is hudi internally calculating the cost of carrying out a trade off between the cost of writing Log Files and then executing a compaction on them v/s directly writing the data to a parquet, and chooses to perform whichever turns out less expensive. Is their some defined threshold for ingress batches crossing which only makes Hudi Write Data into Log Files. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
mansipp commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664944883 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0
hudi-bot commented on PR #9360: URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664931647 ## CI report: * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible
hudi-bot commented on PR #9337: URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664931559 ## CI report: * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037) * dbd715a0c9d98e2bbb69823025a6e314338bc17a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19055) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] Fix hudi-integ-test-bundle dependency on jackson jsk310 package
hudi-bot commented on PR #9324: URL: https://github.com/apache/hudi/pull/9324#issuecomment-1664931493 ## CI report: * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014) * 5f4efeb0992b1fcce899787ddd1182850d51698d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19054) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer
stream2000 commented on code in PR #9199: URL: https://github.com/apache/hudi/pull/9199#discussion_r1283899831 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java: ## @@ -65,7 +71,6 @@ public void write(InternalRow row) throws IOException { int bucketId = BucketIdentifier.getBucketId(String.valueOf(recordKey), indexKeyFields, bucketNum); Pair fileId = Pair.of(partitionPath, bucketId); if (lastFileId == null || !lastFileId.equals(fileId)) { -LOG.info("Creating new file for partition path " + partitionPath); Review Comment: We will not always create a new file handle if we have already created one for this partition so I delete this line. I will add it back to `getBucketRowCreateHandle` when we do need to create a new handle. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkConsistentBucketClusteringExecutionStrategy.java: ## @@ -79,15 +94,19 @@ public HoodieData performClusteringWithRecordsRDD(HoodieData partitioner = new RDDConsistentBucketBulkInsertPartitioner<>(getHoodieTable(), strategyParams, preserveHoodieMetadata); +addHashingChildNodes(partitioner, extraMetadata); + +return (HoodieData) SparkBulkInsertHelper.newInstance() +.bulkInsert(inputRecords, instantTime, getHoodieTable(), newConfig, false, partitioner, true, numOutputGroups); + } + + private void addHashingChildNodes(ConsistentHashingBucketInsertPartitioner partitioner, Map extraMetadata) { try { List nodes = ConsistentHashingNode.fromJsonString(extraMetadata.get(BaseConsistentHashingBucketClusteringPlanStrategy.METADATA_CHILD_NODE_KEY)); Review Comment: Will check the nodes is non-emtpy in `addHashingChildrenNodes` ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.execution.bulkinsert; + +import org.apache.hudi.common.model.ConsistentHashingNode; +import org.apache.hudi.common.model.HoodieConsistentHashingMetadata; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.index.bucket.ConsistentBucketIdentifier; +import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils; +import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex; +import org.apache.hudi.keygen.BuiltinKeyGenerator; +import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory; +import org.apache.hudi.table.BulkInsertPartitioner; +import org.apache.hudi.table.ConsistentHashingBucketInsertPartitioner; +import org.apache.hudi.table.HoodieTable; + +import org.apache.spark.Partitioner; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +import scala.Tuple2; + +/** + * Bulk_insert partitioner of Spark row using consistent hashing bucket index. + */ +public class ConsistentBucketIndexBulkInsertPartitionerWithRows +implements BulkInsertPartitioner>, ConsistentHashingBucketInsertPartitioner { + + private final HoodieTable table; + + private final String indexKeyFields; + + private final List fileIdPfxList = new ArrayList<>(); + private final Map> hashingChildrenNodes; + + private Map partitionToIdentifier; + + private final Option keyGeneratorOpt; + + private Map> partitionToFileIdPfxIdxMap; + + private final RowRecordKeyExtractor extractor; + + public ConsistentBucketIndexBulkInsertPartitionerWithRows(HoodieTable table, boolean populateMetaFields) { +this.indexKeyFields = table.getConfig().getBucketIndexHashField(); +this.table = table; +this.hashingChildrenNodes = new HashMap<>(); +if (!populateMetaFields) { + this.keyGeneratorOpt =
[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0
hudi-bot commented on PR #9360: URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664927344 ## CI report: * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] Fix hudi-integ-test-bundle dependency on jackson jsk310 package
hudi-bot commented on PR #9324: URL: https://github.com/apache/hudi/pull/9324#issuecomment-1664927190 ## CI report: * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014) * 5f4efeb0992b1fcce899787ddd1182850d51698d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
hudi-bot commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664927305 ## CI report: * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19021) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible
hudi-bot commented on PR #9337: URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664927277 ## CI report: * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037) * dbd715a0c9d98e2bbb69823025a6e314338bc17a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9278: [HUDI-6312] Rename enum values of `HollowCommitHandling`
xushiyan commented on code in PR #9278: URL: https://github.com/apache/hudi/pull/9278#discussion_r1283941770 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java: ## @@ -78,18 +78,18 @@ public class HoodieCommonConfig extends HoodieConfig { public static final ConfigProperty INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT = ConfigProperty .key("hoodie.datasource.read.handle.hollow.commit") - .defaultValue(HollowCommitHandling.EXCEPTION.name()) + .defaultValue(HollowCommitHandling.FAIL.name()) Review Comment: @nsivabalan is there a consensus on the name? i'll update it quickly then we can merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664922934 ## CI report: * 7e7efc78003b0e8ef5c2d809276796b7b987a35b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9278: [HUDI-6312] Rename enum values of `HollowCommitHandling`
xushiyan commented on code in PR #9278: URL: https://github.com/apache/hudi/pull/9278#discussion_r1283941125 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java: ## @@ -78,18 +78,18 @@ public class HoodieCommonConfig extends HoodieConfig { public static final ConfigProperty INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT = ConfigProperty .key("hoodie.datasource.read.handle.hollow.commit") - .defaultValue(HollowCommitHandling.EXCEPTION.name()) + .defaultValue(HollowCommitHandling.FAIL.name()) .sinceVersion("0.14.0") .markAdvanced() .withValidValues(enumNames(HollowCommitHandling.class)) .withDocumentation("When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest)" Review Comment: time travel query related change is in a follow up PR. this one is meant only for renaming -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6644) Flink append mode use auto key generator
HBG created HUDI-6644: - Summary: Flink append mode use auto key generator Key: HUDI-6644 URL: https://issues.apache.org/jira/browse/HUDI-6644 Project: Apache Hudi Issue Type: Task Reporter: HBG -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xuzifu666 commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
xuzifu666 commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664917538 > @xuzifu666 why close this? sorry,want to run ci @xushiyan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
xushiyan commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664916791 @xuzifu666 why close this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wecharyu commented on a diff in pull request #9343: Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"
wecharyu commented on code in PR #9343: URL: https://github.com/apache/hudi/pull/9343#discussion_r1283930796 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -168,57 +167,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path, p -> { - try { -return fileSystem.isDirectory(p) && !p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME); - } catch (IOException e) { -// noop - } - return false; -})).map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { Review Comment: Got it. Let me investigate why the latency happened. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 opened a new pull request, #9360: [MINOR] Upgrade thrift's version to 0.13.0
eric9204 opened a new pull request, #9360: URL: https://github.com/apache/hudi/pull/9360 ### Change Logs Upgrade thrift's version from 0.12.0 to 0.13.0. There is an issue in thrift-0.12.0: https://issues.apache.org/jira/browse/THRIFT-4805 Will cause the following problems: ![error](https://github.com/apache/hudi/assets/90449228/3e6fa891-4bc5-4385-b5a7-d4fed4a26131) This error message has anything to do with the program, but if there is a program connected to hoodie metaserver, this error log message will be continuously output. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
hudi-bot commented on PR #9357: URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664898540 ## CI report: * d8e159b823f516f584802bd3dacdaa782f185854 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038) * c5e0ade2fbd9a9f5d0d7336c138de99b697b07cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19053) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
hudi-bot commented on PR #9357: URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664893935 ## CI report: * d8e159b823f516f584802bd3dacdaa782f185854 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038) * c5e0ade2fbd9a9f5d0d7336c138de99b697b07cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
hudi-bot commented on PR #9330: URL: https://github.com/apache/hudi/pull/9330#issuecomment-1664893863 ## CI report: * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN * d6d32a693c455830a31b883915e9940fa309c77f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026) * df00f708b0143c2a3ce5dec86ebcd69c26b91a1b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19051) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9308: [HUDI-6606] Use record level index with SQL equality queries
nsivabalan commented on code in PR #9308: URL: https://github.com/apache/hudi/pull/9308#discussion_r1283913819 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndexWithSQL.scala: ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions} +import org.apache.spark.sql.SaveMode +import org.junit.jupiter.api.Tag +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.ValueSource + +@Tag("functional") +class TestRecordLevelIndexWithSQL extends RecordLevelIndexTestBase { + val sqlTempTable = "tbl" + + @ParameterizedTest + @ValueSource(strings = Array("COPY_ON_WRITE")) + def testRLIWithSQL(tableType: String): Unit = { +var hudiOpts = commonOpts +hudiOpts = hudiOpts + ( + DataSourceWriteOptions.TABLE_TYPE.key -> tableType, + DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true") + +doWriteAndValidateDataAndRecordIndex(hudiOpts, + operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Overwrite, + validate = false) +doWriteAndValidateDataAndRecordIndex(hudiOpts, + operation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Append, + validate = false) + +createTempTable(hudiOpts) +val reckey = mergedDfList.last.limit(1).collect()(0).getAs("_row_key").toString +spark.sql("select * from " + sqlTempTable + " where '" + reckey + "' = _row_key").show(false) Review Comment: this test does not catch anything. canyou write tests directly against RLI support. lets also think about how we can write an end to end test and validate that RLI is invoked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
hudi-bot commented on PR #9330: URL: https://github.com/apache/hudi/pull/9330#issuecomment-1664888349 ## CI report: * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN * d6d32a693c455830a31b883915e9940fa309c77f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026) * df00f708b0143c2a3ce5dec86ebcd69c26b91a1b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9308: [HUDI-6606] Use record level index with SQL equality queries
nsivabalan commented on code in PR #9308: URL: https://github.com/apache/hudi/pull/9308#discussion_r1283907863 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -223,10 +225,13 @@ case class HoodieFileIndex(spark: SparkSession, // nothing CSI in particular could be applied for) lazy val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema) -if (!isMetadataTableEnabled || !isDataSkippingEnabled || !columnStatsIndex.isIndexAvailable) { +if (!isMetadataTableEnabled || !isDataSkippingEnabled) { validateConfig() Option.empty -} else if (queryFilters.isEmpty || queryReferencedColumns.isEmpty) { +} else if (recordLevelIndex.isIndexApplicable(queryFilters)) { + Option.apply(recordLevelIndex.getCandidateFiles(allFiles, queryFilters)) +} else if (!columnStatsIndex.isIndexAvailable || queryFilters.isEmpty || queryReferencedColumns.isEmpty) { + validateConfig() Review Comment: lets atleast fix the filtering such that just for record key we look up in RLI, for, but if there are other columns, we do route to col stats. and combine both together. currently, this seems very limited. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6640) Non-blocking concurrency control
[ https://issues.apache.org/jira/browse/HUDI-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reassigned HUDI-6640: Assignee: Jing Zhang > Non-blocking concurrency control > > > Key: HUDI-6640 > URL: https://issues.apache.org/jira/browse/HUDI-6640 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Danny Chen >Assignee: Jing Zhang >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6643) Make the compaction non-serial (plan schedule and execution)
Danny Chen created HUDI-6643: Summary: Make the compaction non-serial (plan schedule and execution) Key: HUDI-6643 URL: https://issues.apache.org/jira/browse/HUDI-6643 Project: Apache Hudi Issue Type: Sub-task Components: core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6480) Flink lockless multi-writer
[ https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6480: - Parent: HUDI-6640 Issue Type: Sub-task (was: New Feature) > Flink lockless multi-writer > --- > > Key: HUDI-6480 > URL: https://issues.apache.org/jira/browse/HUDI-6480 > Project: Apache Hudi > Issue Type: Sub-task > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6642) Use completion time for file slicing
Danny Chen created HUDI-6642: Summary: Use completion time for file slicing Key: HUDI-6642 URL: https://issues.apache.org/jira/browse/HUDI-6642 Project: Apache Hudi Issue Type: Sub-task Components: core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6641) Remove the log append and always uses the current instant time in file name
Danny Chen created HUDI-6641: Summary: Remove the log append and always uses the current instant time in file name Key: HUDI-6641 URL: https://issues.apache.org/jira/browse/HUDI-6641 Project: Apache Hudi Issue Type: Sub-task Components: core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6640) Non-blocking concurrency control
Danny Chen created HUDI-6640: Summary: Non-blocking concurrency control Key: HUDI-6640 URL: https://issues.apache.org/jira/browse/HUDI-6640 Project: Apache Hudi Issue Type: Epic Components: core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] voonhous commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
voonhous commented on code in PR #9357: URL: https://github.com/apache/hudi/pull/9357#discussion_r1283900588 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java: ## @@ -448,7 +448,8 @@ private boolean flushBucket(DataBucket bucket) { } @SuppressWarnings("unchecked, rawtypes") - private void flushRemaining(boolean endInput) { + @VisibleForTesting + protected void flushRemaining(boolean endInput) { Review Comment: Yeap, so we can override this function in `BucketStreamWriteTestFunction`. ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunctionWithFailOverTest.java: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.bucket; + +import org.apache.hudi.client.common.HoodieFlinkEngineContext; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsInference; +import org.apache.hudi.configuration.OptionsResolver; +import org.apache.hudi.index.bucket.BucketIdentifier; +import org.apache.hudi.sink.partitioner.BucketIndexPartitioner; +import org.apache.hudi.sink.transform.Transformer; +import org.apache.hudi.sink.utils.Pipelines; +import org.apache.hudi.table.HoodieFlinkTable; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.JsonDeserializationFunction; +import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.FlinkMiniCluster; +import org.apache.hudi.utils.TestConfigurations; +import org.apache.hudi.utils.source.ContinuousFileSource; + +import org.apache.flink.api.common.JobStatus; +import org.apache.flink.api.common.io.FilePathFilter; +import org.apache.flink.api.common.restartstrategy.RestartStrategies; +import org.apache.flink.api.common.typeinfo.BasicTypeInfo; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.io.TextInputFormat; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.execution.JobClient; +import org.apache.flink.core.fs.Path; +import org.apache.flink.streaming.api.CheckpointingMode; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.source.FileProcessingMode; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; +import org.apache.flink.util.TestLogger; +import org.apache.hadoop.fs.FileStatus; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.io.File; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.stream.Collectors; + +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Integration test for Hudi on Flink with failover when performing writes using the bucket index. + */ +@ExtendWith(FlinkMiniCluster.class) +public class BucketStreamWriteFunctionWithFailOverTest extends TestLogger { + + @TempDir Review Comment: Read through the test and i feel they are covering different scenarios. The IT will fail on the first write instead of after the first write + will test bucket index instead of bucket index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Reopened] (HUDI-2141) Integration flink metric in flink stream
[ https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reopened HUDI-2141: -- > Integration flink metric in flink stream > > > Key: HUDI-2141 > URL: https://issues.apache.org/jira/browse/HUDI-2141 > Project: Apache Hudi > Issue Type: Improvement > Components: flink, metrics >Reporter: Zhaojing Yu >Assignee: Zhaojing Yu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Now hoodie metrics can't work in flink stream because Designed for batch > processing, integration flink metric in flink stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-2141) Integration flink metric in flink stream
[ https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-2141. Resolution: Fixed Fixed via master branch: ade9d0bcb9d7ad7adabfaeb5ff2f42bc0585fdb1 > Integration flink metric in flink stream > > > Key: HUDI-2141 > URL: https://issues.apache.org/jira/browse/HUDI-2141 > Project: Apache Hudi > Issue Type: Improvement > Components: flink, metrics >Reporter: Zhaojing Yu >Assignee: Zhaojing Yu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Now hoodie metrics can't work in flink stream because Designed for batch > processing, integration flink metric in flink stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (bc583b41586 -> ade9d0bcb9d)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from bc583b41586 [HUDI-6609] Reverting multi writer checkpointing with HoodieStreamer (#9312) add ade9d0bcb9d [HUDI-2141] Support flink read metrics (#9350) No new revisions were added by this update. Summary of changes: hudi-flink-datasource/hudi-flink/pom.xml | 6 ++ .../hudi/metrics/FlinkStreamReadMetrics.java | 96 ++ .../apache/hudi/metrics/HoodieFlinkMetrics.java} | 29 +++ .../hudi/source/StreamReadMonitoringFunction.java | 13 +++ .../org/apache/hudi/source/StreamReadOperator.java | 14 packaging/hudi-flink-bundle/pom.xml| 6 ++ 6 files changed, 146 insertions(+), 18 deletions(-) create mode 100644 hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkStreamReadMetrics.java copy hudi-flink-datasource/hudi-flink/src/{test/java/org/apache/hudi/sink/utils/ScalaCollector.java => main/java/org/apache/hudi/metrics/HoodieFlinkMetrics.java} (62%)
[GitHub] [hudi] danny0405 merged pull request #9350: [HUDI-2141] Support flink read metrics
danny0405 merged PR #9350: URL: https://github.com/apache/hudi/pull/9350 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
danny0405 commented on code in PR #9357: URL: https://github.com/apache/hudi/pull/9357#discussion_r1283890740 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunctionWithFailOverTest.java: ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.bucket; + +import org.apache.hudi.client.common.HoodieFlinkEngineContext; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsInference; +import org.apache.hudi.configuration.OptionsResolver; +import org.apache.hudi.index.bucket.BucketIdentifier; +import org.apache.hudi.sink.partitioner.BucketIndexPartitioner; +import org.apache.hudi.sink.transform.Transformer; +import org.apache.hudi.sink.utils.Pipelines; +import org.apache.hudi.table.HoodieFlinkTable; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.JsonDeserializationFunction; +import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.FlinkMiniCluster; +import org.apache.hudi.utils.TestConfigurations; +import org.apache.hudi.utils.source.ContinuousFileSource; + +import org.apache.flink.api.common.JobStatus; +import org.apache.flink.api.common.io.FilePathFilter; +import org.apache.flink.api.common.restartstrategy.RestartStrategies; +import org.apache.flink.api.common.typeinfo.BasicTypeInfo; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.io.TextInputFormat; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.core.execution.JobClient; +import org.apache.flink.core.fs.Path; +import org.apache.flink.streaming.api.CheckpointingMode; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.source.FileProcessingMode; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; +import org.apache.flink.util.TestLogger; +import org.apache.hadoop.fs.FileStatus; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.io.File; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.stream.Collectors; + +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Integration test for Hudi on Flink with failover when performing writes using the bucket index. + */ +@ExtendWith(FlinkMiniCluster.class) +public class BucketStreamWriteFunctionWithFailOverTest extends TestLogger { + + @TempDir Review Comment: There is no need to add the IT I think, the `TestWriteCopyOnWrite#testSubtaskFails` can cover this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
danny0405 commented on code in PR #9357: URL: https://github.com/apache/hudi/pull/9357#discussion_r1283889835 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java: ## @@ -448,7 +448,8 @@ private boolean flushBucket(DataBucket bucket) { } @SuppressWarnings("unchecked, rawtypes") - private void flushRemaining(boolean endInput) { + @VisibleForTesting + protected void flushRemaining(boolean endInput) { Review Comment: Do we need this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9343: Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"
danny0405 commented on code in PR #9343: URL: https://github.com/apache/hudi/pull/9343#discussion_r1283888081 ## hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java: ## @@ -168,57 +167,66 @@ private List getPartitionPathWithPathPrefixUsingFilterExpression(String // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel: - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. + // List all directories in parallel engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. - // and second entry holds optionally a directory path to be processed further. - List, Option>> result = engineContext.flatMap(pathsToList, path -> { + List dirToFileListing = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { - return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); -} -return Arrays.stream(fileSystem.listStatus(path, p -> { - try { -return fileSystem.isDirectory(p) && !p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME); - } catch (IOException e) { -// noop - } - return false; -})).map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; +return Arrays.stream(fileSystem.listStatus(path)); }, listingParallelism); pathsToList.clear(); - partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) - .map(entry -> entry.getKey().get()) - .filter(relativePartitionPath -> fullBoundExpr instanceof Predicates.TrueExpression - || (Boolean) fullBoundExpr.eval( - extractPartitionValues(partitionFields, relativePartitionPath, urlEncodePartitioningEnabled))) - .collect(Collectors.toList())); - - Expression partialBoundExpr; - // If partitionPaths is nonEmpty, we're already at the last path level, and all paths - // are filtered already. - if (needPushDownExpressions && partitionPaths.isEmpty()) { -// Here we assume the path level matches the number of partition columns, so we'll rebuild -// new schema based on current path level. -// e.g. partition columns are , if we're listing the second level, then -// currentSchema would be -// `PartialBindVisitor` will bind reference if it can be found from `currentSchema`, otherwise -// will change the expression to `alwaysTrue`. Can see `PartialBindVisitor` for details. -Types.RecordType currentSchema = Types.RecordType.get(partitionFields.fields().subList(0, ++currentPartitionLevel)); -PartialBindVisitor partialBindVisitor = new PartialBindVisitor(currentSchema, caseSensitive); -partialBoundExpr = pushedExpr.accept(partialBindVisitor); - } else { -partialBoundExpr = Predicates.alwaysTrue(); - } + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. + int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); + if (!dirToFileListing.isEmpty()) { +// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. +// and second entry holds optionally a directory path to be processed further. +engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); +List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { + FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); + if (fileStatus.isDirectory()) { +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { Review Comment: The revert makes sense because the code freeze is very near, but we still needs to give a cladification for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
danny0405 commented on PR #9327: URL: https://github.com/apache/hudi/pull/9327#issuecomment-1664854084 Test have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19024=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664852578 ## CI report: * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048) * 7308946c4344ca04736b2c83b505d3a159146541 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19050) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664847379 ## CI report: * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045) * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048) * 7308946c4344ca04736b2c83b505d3a159146541 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664847144 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19046) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664846409 ## CI report: * 214bc9b6f9c0d522f4196cca12daf4756dc96439 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mansipp commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
mansipp commented on code in PR #9347: URL: https://github.com/apache/hudi/pull/9347#discussion_r1283879950 ## hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java: ## @@ -143,31 +142,32 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) LOG.info("Adding " + partitionsToAdd.size() + " partition(s) in table " + tableId(databaseName, tableName)); try { Table table = getTable(awsGlue, databaseName, tableName); - StorageDescriptor sd = table.getStorageDescriptor(); + StorageDescriptor sd = table.storageDescriptor(); List partitionInputs = partitionsToAdd.stream().map(partition -> { -StorageDescriptor partitionSd = sd.clone(); String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), partition).toString(); List partitionValues = partitionValueExtractor.extractPartitionValuesInPath(partition); -partitionSd.setLocation(fullPartitionPath); -return new PartitionInput().withValues(partitionValues).withStorageDescriptor(partitionSd); +StorageDescriptor partitionSD = sd.copy(copySd -> copySd.location(fullPartitionPath)); +return PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build(); }).collect(Collectors.toList()); - List> futures = new ArrayList<>(); + List> futures = new ArrayList<>(); for (List batch : CollectionUtils.batches(partitionInputs, MAX_PARTITIONS_PER_REQUEST)) { -BatchCreatePartitionRequest request = new BatchCreatePartitionRequest(); - request.withDatabaseName(databaseName).withTableName(tableName).withPartitionInputList(batch); -futures.add(awsGlue.batchCreatePartitionAsync(request)); +BatchCreatePartitionRequest request = BatchCreatePartitionRequest.builder() + .databaseName(databaseName).tableName(tableName).partitionInputList(batch).build(); +futures.add(awsGlue.batchCreatePartition(request)); } - for (Future future : futures) { -BatchCreatePartitionResult result = future.get(); -if (CollectionUtils.nonEmpty(result.getErrors())) { - if (result.getErrors().stream().allMatch((error) -> "AlreadyExistsException".equals(error.getErrorDetail().getErrorCode( { -LOG.warn("Partitions already exist in glue: " + result.getErrors()); + for (CompletableFuture future : futures) { +BatchCreatePartitionResponse response = future.get(); +if (CollectionUtils.nonEmpty(response.errors())) { + if (response.errors().stream() + .allMatch( + (error) -> "AlreadyExistsException".equals(error.errorDetail().errorCode( { Review Comment: We haven't changed anything here other than making it compatible with sdk v2. Haven't run any test to specifically test this error case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mansipp commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
mansipp commented on code in PR #9347: URL: https://github.com/apache/hudi/pull/9347#discussion_r1283877492 ## hudi-aws/src/main/java/org/apache/hudi/aws/transaction/lock/DynamoDBBasedLockProvider.java: ## @@ -153,45 +153,53 @@ public LockItem getLock() { return lock; } - private AmazonDynamoDB getDynamoDBClient() { + private DynamoDbClient getDynamoDBClient() { String region = this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_LOCK_REGION); String endpointURL = this.dynamoDBLockConfiguration.contains(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL.key()) - ? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) - : RegionUtils.getRegion(region).getServiceEndpoint(AmazonDynamoDB.ENDPOINT_PREFIX); -AwsClientBuilder.EndpointConfiguration dynamodbEndpoint = -new AwsClientBuilder.EndpointConfiguration(endpointURL, region); -return AmazonDynamoDBClientBuilder.standard() -.withEndpointConfiguration(dynamodbEndpoint) - .withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(dynamoDBLockConfiguration.getProps())) +? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) +: DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString(); + +if (!endpointURL.startsWith("https://;) || !endpointURL.startsWith("http://;)) { Review Comment: We are getting this error if we don't specify http:// or https:// in sdk v2. Based on this doc dynamodb supports both http and https so added a check for both. https://docs.aws.amazon.com/general/latest/gr/ddb.html ``` Caused by: java.lang.NullPointerException: The URI scheme of endpointOverride must not be null. at org.apache.hudi.software.amazon.awssdk.utils.Validate.paramNotNull(Validate.java:156) at org.apache.hudi.software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.endpointOverride(SdkDefaultClientBuilder.java:445) at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.getDynamoDBClient(DynamoDBBasedLockProvider.java:163) at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:87) at org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:77) ... 65 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
yihua commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664836797 @mansipp could you also check the Azure CI failure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2
yihua commented on code in PR #9347: URL: https://github.com/apache/hudi/pull/9347#discussion_r1283849039 ## pom.xml: ## @@ -130,6 +130,8 @@ 1.5.6 0.16 0.8.0 +4.5.13 +4.4.13 Review Comment: Any reason we pick different versions here? ## hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java: ## @@ -143,31 +142,32 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) LOG.info("Adding " + partitionsToAdd.size() + " partition(s) in table " + tableId(databaseName, tableName)); try { Table table = getTable(awsGlue, databaseName, tableName); - StorageDescriptor sd = table.getStorageDescriptor(); + StorageDescriptor sd = table.storageDescriptor(); List partitionInputs = partitionsToAdd.stream().map(partition -> { -StorageDescriptor partitionSd = sd.clone(); String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), partition).toString(); List partitionValues = partitionValueExtractor.extractPartitionValuesInPath(partition); -partitionSd.setLocation(fullPartitionPath); -return new PartitionInput().withValues(partitionValues).withStorageDescriptor(partitionSd); +StorageDescriptor partitionSD = sd.copy(copySd -> copySd.location(fullPartitionPath)); +return PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build(); }).collect(Collectors.toList()); - List> futures = new ArrayList<>(); + List> futures = new ArrayList<>(); for (List batch : CollectionUtils.batches(partitionInputs, MAX_PARTITIONS_PER_REQUEST)) { -BatchCreatePartitionRequest request = new BatchCreatePartitionRequest(); - request.withDatabaseName(databaseName).withTableName(tableName).withPartitionInputList(batch); -futures.add(awsGlue.batchCreatePartitionAsync(request)); +BatchCreatePartitionRequest request = BatchCreatePartitionRequest.builder() + .databaseName(databaseName).tableName(tableName).partitionInputList(batch).build(); +futures.add(awsGlue.batchCreatePartition(request)); } - for (Future future : futures) { -BatchCreatePartitionResult result = future.get(); -if (CollectionUtils.nonEmpty(result.getErrors())) { - if (result.getErrors().stream().allMatch((error) -> "AlreadyExistsException".equals(error.getErrorDetail().getErrorCode( { -LOG.warn("Partitions already exist in glue: " + result.getErrors()); + for (CompletableFuture future : futures) { +BatchCreatePartitionResponse response = future.get(); +if (CollectionUtils.nonEmpty(response.errors())) { + if (response.errors().stream() + .allMatch( + (error) -> "AlreadyExistsException".equals(error.errorDetail().errorCode( { Review Comment: I assume the error name is the same here and this should still work. Is the error case tested? ## hudi-utilities/pom.xml: ## @@ -450,6 +450,14 @@ test + + + + software.amazon.awssdk + sqs + ${aws.sdk.version} + Review Comment: Same question as before: https://github.com/apache/hudi/pull/8441/files#r1164644285 Should this be moved to `hudi-aws` module (`hudi-utilities` module has already relied on `hudi-aws` module)? ## hudi-aws/src/main/java/org/apache/hudi/aws/transaction/lock/DynamoDBBasedLockProvider.java: ## @@ -153,45 +153,53 @@ public LockItem getLock() { return lock; } - private AmazonDynamoDB getDynamoDBClient() { + private DynamoDbClient getDynamoDBClient() { String region = this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_LOCK_REGION); String endpointURL = this.dynamoDBLockConfiguration.contains(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL.key()) - ? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) - : RegionUtils.getRegion(region).getServiceEndpoint(AmazonDynamoDB.ENDPOINT_PREFIX); -AwsClientBuilder.EndpointConfiguration dynamodbEndpoint = -new AwsClientBuilder.EndpointConfiguration(endpointURL, region); -return AmazonDynamoDBClientBuilder.standard() -.withEndpointConfiguration(dynamodbEndpoint) - .withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(dynamoDBLockConfiguration.getProps())) +? this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL) +: DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString(); + +if (!endpointURL.startsWith("https://;) || !endpointURL.startsWith("http://;)) { Review Comment: could the endpoint URL start without the HTTP prefix? -- This is an automated message from the Apache Git
[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer
leesf commented on code in PR #9199: URL: https://github.com/apache/hudi/pull/9199#discussion_r1283870818 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/ConsistentBucketBulkInsertDataInternalWriterHelper.java: ## @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.commit; + +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.ConsistentHashingNode; +import org.apache.hudi.common.model.HoodieConsistentHashingMetadata; +import org.apache.hudi.common.model.HoodieFileGroupId; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.index.bucket.ConsistentBucketIdentifier; +import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils; +import org.apache.hudi.io.storage.row.HoodieRowCreateHandle; +import org.apache.hudi.table.HoodieTable; +import org.apache.hudi.table.action.cluster.util.ConsistentHashingUpdateStrategyUtils; + +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.types.StructType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Collections; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * Helper class for native row writer for bulk_insert with consistent hashing bucket index. + */ +public class ConsistentBucketBulkInsertDataInternalWriterHelper extends BucketBulkInsertDataInternalWriterHelper { + + private static final Logger LOG = LoggerFactory.getLogger(ConsistentBucketBulkInsertDataInternalWriterHelper.class); + + public ConsistentBucketBulkInsertDataInternalWriterHelper(HoodieTable hoodieTable, HoodieWriteConfig writeConfig, Review Comment: the constructor has too many args, which can wrap in a class? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer
leesf commented on code in PR #9199: URL: https://github.com/apache/hudi/pull/9199#discussion_r1283870066 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java: ## @@ -65,7 +71,6 @@ public void write(InternalRow row) throws IOException { int bucketId = BucketIdentifier.getBucketId(String.valueOf(recordKey), indexKeyFields, bucketNum); Pair fileId = Pair.of(partitionPath, bucketId); if (lastFileId == null || !lastFileId.equals(fileId)) { -LOG.info("Creating new file for partition path " + partitionPath); Review Comment: this log is useless then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer
leesf commented on code in PR #9199: URL: https://github.com/apache/hudi/pull/9199#discussion_r1283867267 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java: ## @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.execution.bulkinsert; + +import org.apache.hudi.common.model.ConsistentHashingNode; +import org.apache.hudi.common.model.HoodieConsistentHashingMetadata; +import org.apache.hudi.common.model.HoodieTableType; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.index.bucket.ConsistentBucketIdentifier; +import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils; +import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex; +import org.apache.hudi.keygen.BuiltinKeyGenerator; +import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory; +import org.apache.hudi.table.BulkInsertPartitioner; +import org.apache.hudi.table.ConsistentHashingBucketInsertPartitioner; +import org.apache.hudi.table.HoodieTable; + +import org.apache.spark.Partitioner; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +import scala.Tuple2; + +/** + * Bulk_insert partitioner of Spark row using consistent hashing bucket index. + */ +public class ConsistentBucketIndexBulkInsertPartitionerWithRows +implements BulkInsertPartitioner>, ConsistentHashingBucketInsertPartitioner { + + private final HoodieTable table; + + private final String indexKeyFields; + + private final List fileIdPfxList = new ArrayList<>(); + private final Map> hashingChildrenNodes; + + private Map partitionToIdentifier; + + private final Option keyGeneratorOpt; + + private Map> partitionToFileIdPfxIdxMap; + + private final RowRecordKeyExtractor extractor; + + public ConsistentBucketIndexBulkInsertPartitionerWithRows(HoodieTable table, boolean populateMetaFields) { +this.indexKeyFields = table.getConfig().getBucketIndexHashField(); +this.table = table; +this.hashingChildrenNodes = new HashMap<>(); +if (!populateMetaFields) { + this.keyGeneratorOpt = HoodieSparkKeyGeneratorFactory.getKeyGenerator(table.getConfig().getProps()); +} else { + this.keyGeneratorOpt = Option.empty(); +} +this.extractor = RowRecordKeyExtractor.getRowRecordKeyExtractor(populateMetaFields, keyGeneratorOpt); + ValidationUtils.checkArgument(table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ), +"Consistent hash bucket index doesn't support CoW table"); + } + + private ConsistentBucketIdentifier getBucketIdentifier(String partition) { +HoodieSparkConsistentBucketIndex index = (HoodieSparkConsistentBucketIndex) table.getIndex(); +HoodieConsistentHashingMetadata metadata = ConsistentBucketIndexUtils.loadOrCreateMetadata(this.table, partition, index.getNumBuckets()); +if (hashingChildrenNodes.containsKey(partition)) { + metadata.setChildrenNodes(hashingChildrenNodes.get(partition)); +} +return new ConsistentBucketIdentifier(metadata); + } + + @Override + public Dataset repartitionRecords(Dataset rows, int outputPartitions) { +JavaRDD rowJavaRDD = rows.toJavaRDD(); +prepareRepartition(rowJavaRDD); +Partitioner partitioner = new Partitioner() { + @Override + public int getPartition(Object key) { +return (int) key; + } + + @Override + public int numPartitions() { +return fileIdPfxList.size(); + } +}; + +return rows.sparkSession().createDataFrame(rowJavaRDD +.mapToPair(row -> new Tuple2<>(getBucketId(row), row)) +.partitionBy(partitioner) +.values(), rows.schema()); + } + + /** + * Prepare consistent hashing metadata for repartition + * + * @param rows input records + */ + private void prepareRepartition(JavaRDD rows) { +this.partitionToIdentifier =
[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer
leesf commented on code in PR #9199: URL: https://github.com/apache/hudi/pull/9199#discussion_r1283865112 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkConsistentBucketClusteringExecutionStrategy.java: ## @@ -79,15 +94,19 @@ public HoodieData performClusteringWithRecordsRDD(HoodieData partitioner = new RDDConsistentBucketBulkInsertPartitioner<>(getHoodieTable(), strategyParams, preserveHoodieMetadata); +addHashingChildNodes(partitioner, extraMetadata); + +return (HoodieData) SparkBulkInsertHelper.newInstance() +.bulkInsert(inputRecords, instantTime, getHoodieTable(), newConfig, false, partitioner, true, numOutputGroups); + } + + private void addHashingChildNodes(ConsistentHashingBucketInsertPartitioner partitioner, Map extraMetadata) { try { List nodes = ConsistentHashingNode.fromJsonString(extraMetadata.get(BaseConsistentHashingBucketClusteringPlanStrategy.METADATA_CHILD_NODE_KEY)); Review Comment: will the nodes be null and need check in addHashingChildrenNodes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
amrishlal commented on code in PR #9359: URL: https://github.com/apache/hudi/pull/9359#discussion_r1283857986 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -194,9 +194,9 @@ trait ProvidesHoodieConfig extends Logging { val insertMode = InsertMode.of(combinedOpts.getOrElse(DataSourceWriteOptions.SQL_INSERT_MODE.key, DataSourceWriteOptions.SQL_INSERT_MODE.defaultValue())) val insertModeSet = combinedOpts.contains(SQL_INSERT_MODE.key) -val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key()) +val sqlWriteOperationOpt = combinedOpts.get(SPARK_SQL_INSERT_INTO_OPERATION.key()) val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty -val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue()) +val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SPARK_SQL_INSERT_INTO_OPERATION.defaultValue()) Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
hudi-bot commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1664818603 ## CI report: * abf8721378220e0d669aee49599a231ceff34e19 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19030) * 884a71af797b71a7f5818472884e45f39f758328 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19049) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
nsivabalan commented on PR #9327: URL: https://github.com/apache/hudi/pull/9327#issuecomment-1664806952 @prashantwason can you review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664806676 ## CI report: * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045) * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
hudi-bot commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1664805693 ## CI report: * abf8721378220e0d669aee49599a231ceff34e19 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19030) * 884a71af797b71a7f5818472884e45f39f758328 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8441: Upgrade aws java sdk to v2
yihua commented on code in PR #8441: URL: https://github.com/apache/hudi/pull/8441#discussion_r1283850979 ## hudi-aws/src/main/java/org/apache/hudi/aws/credentials/HoodieAWSCredentialsProviderFactory.java: ## @@ -30,16 +30,23 @@ * Factory class for Hoodie AWSCredentialsProvider. */ public class HoodieAWSCredentialsProviderFactory { - public static AWSCredentialsProvider getAwsCredentialsProvider(Properties props) { + public static AwsCredentialsProvider getAwsCredentialsProvider(Properties props) { return getAwsCredentialsProviderChain(props); } - private static AWSCredentialsProvider getAwsCredentialsProviderChain(Properties props) { -List providers = new ArrayList<>(); -providers.add(new HoodieConfigAWSCredentialsProvider(props)); -providers.add(new DefaultAWSCredentialsProviderChain()); -AWSCredentialsProviderChain providerChain = new AWSCredentialsProviderChain(providers); -providerChain.setReuseLastProvider(true); + private static AwsCredentialsProvider getAwsCredentialsProviderChain(Properties props) { +List providers = new ArrayList<>(); +HoodieConfigAWSCredentialsProvider hoodieConfigAWSCredentialsProvider = new HoodieConfigAWSCredentialsProvider(props); +if (hoodieConfigAWSCredentialsProvider.resolveCredentials() != null) { + providers.add(hoodieConfigAWSCredentialsProvider); Review Comment: Got it. Then we can keep it this way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664798077 ## CI report: * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045) * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9308: [HUDI-6606] Use record level index with SQL equality queries
hudi-bot commented on PR #9308: URL: https://github.com/apache/hudi/pull/9308#issuecomment-1664797937 ## CI report: * ae0002b81c71f77e0c19aeb3e5872b0eb399ddb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19043) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
nsivabalan commented on code in PR #9359: URL: https://github.com/apache/hudi/pull/9359#discussion_r1283833724 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -194,9 +194,9 @@ trait ProvidesHoodieConfig extends Logging { val insertMode = InsertMode.of(combinedOpts.getOrElse(DataSourceWriteOptions.SQL_INSERT_MODE.key, DataSourceWriteOptions.SQL_INSERT_MODE.defaultValue())) val insertModeSet = combinedOpts.contains(SQL_INSERT_MODE.key) -val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key()) +val sqlWriteOperationOpt = combinedOpts.get(SPARK_SQL_INSERT_INTO_OPERATION.key()) val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty -val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue()) +val sqlWriteOperation = sqlWriteOperationOpt.getOrElse(SPARK_SQL_INSERT_INTO_OPERATION.defaultValue()) Review Comment: minor. can we rename the method deduceSqlWriteOperation to deduceSparkSqlInsetIntoWriteOperation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
amrishlal commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1283805598 ## pom.xml: ## @@ -98,8 +98,6 @@ ${fasterxml.spark3.version} ${fasterxml.spark3.version} ${fasterxml.spark3.version} - - Review Comment: @xushiyan Fixed. Also verified that the correct version of JSR package (matching what is including in spark 3.x runtime) is being included. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664733692 ## CI report: * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034) * 7e7efc78003b0e8ef5c2d809276796b7b987a35b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664728409 ## CI report: * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034) * 7e7efc78003b0e8ef5c2d809276796b7b987a35b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664723220 ## CI report: * 5c1391571fbed3ce391399b7848f33b629455941 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664686919 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * 1875a19fd05f413373eb1f2400f390706d62725e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042) * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19046) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664679688 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * af768285ff85e68a361482806a5de1e0b9c272a9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19035) * 1875a19fd05f413373eb1f2400f390706d62725e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042) * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2
mansipp commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664649658 Manually tested s3a path using EMR cluster. ```scala spark-shell \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \ --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \ --conf "spark.sql.hive.convertMetastoreParquet=false" \ --jars /usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jar ``` ```scala import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.hudi.hive.HiveSyncConfig import org.apache.hudi.sync.common.HoodieSyncConfig // Create a DataFrame var tableName = "mansi_s3a_hudi_test" var tablePath = "s3a:///tables/" + tableName val inputDF = Seq( ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z") ).toDF("id", "creation_date", "last_update_time") //Specify common DataSourceWriteOptions in the single hudiOptions variable val hudiOptions = Map[String,String]( HoodieWriteConfig.TABLE_NAME -> tableName, DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date", DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName, DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date", DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName ) // Write the DataFrame as a Hudi dataset (inputDF.write .format("org.apache.hudi") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .options(hudiOptions) .mode(SaveMode.Overwrite) .save(tablePath)) ``` ![Image](https://github.com/apache/hudi/assets/107222979/672d5057-cbe5-43f7-b927-838510b1220c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664628275 ## CI report: * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
jonvex commented on code in PR #9276: URL: https://github.com/apache/hudi/pull/9276#discussion_r1283702970 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapMORRelation.scala: ## @@ -58,10 +59,13 @@ case class HoodieBootstrapMORRelation(override val sqlContext: SQLContext, override val optParams: Map[String, String], private val prunedDataSchema: Option[StructType] = None) extends BaseHoodieBootstrapRelation(sqlContext, userSchema, globPaths, metaClient, -optParams, prunedDataSchema) { +optParams, prunedDataSchema) with SparkAdapterSupport { override type Relation = HoodieBootstrapMORRelation + protected val mergeType: String = optParams.getOrElse(DataSourceReadOptions.REALTIME_MERGE.key, +DataSourceReadOptions.REALTIME_MERGE.defaultValue) + Review Comment: Nope. Removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
xushiyan commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1283702886 ## pom.xml: ## @@ -98,8 +98,6 @@ ${fasterxml.spark3.version} ${fasterxml.spark3.version} ${fasterxml.spark3.version} - - Review Comment: as this variable is removed, can you global search and make sure all refs are updated? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
hudi-bot commented on PR #9359: URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664617802 ## CI report: * e8163de397fcdbf1b3194bcf696435be9d28c171 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery
hudi-bot commented on PR #9357: URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664608150 ## CI report: * d8e159b823f516f584802bd3dacdaa782f185854 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
jonvex commented on code in PR #9276: URL: https://github.com/apache/hudi/pull/9276#discussion_r1283684686 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -247,11 +247,23 @@ object DefaultSource { Option(schema) } + + + if (metaClient.getCommitsTimeline.filterCompletedInstants.countInstants() == 0) { new EmptyRelation(sqlContext, resolveSchema(metaClient, parameters, Some(schema))) } else if (isCdcQuery) { CDCRelation.getCDCRelation(sqlContext, metaClient, parameters) } else { + lazy val newHudiFileFormatUtils = if (!parameters.getOrElse(LEGACY_HUDI_PARQUET_FILE_FORMAT.key, +LEGACY_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && (globPaths == null || globPaths.isEmpty) +&& parameters.getOrElse(REALTIME_MERGE.key(), REALTIME_MERGE.defaultValue()) +.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL)) { Review Comment: Yes. I wasn't able to get it to work correctly before the code freeze -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6639) Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
[ https://issues.apache.org/jira/browse/HUDI-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6639: - Labels: pull-request-available (was: ) > Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation > --- > > Key: HUDI-6639 > URL: https://issues.apache.org/jira/browse/HUDI-6639 > Project: Apache Hudi > Issue Type: Bug >Reporter: Amrish Lal >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] amrishlal opened a new pull request, #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
amrishlal opened a new pull request, #9359: URL: https://github.com/apache/hudi/pull/9359 Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation ### Change Logs Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation ### Impact Rename config to better describe what it is for. ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [X] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [X] Change Logs and Impact were stated clearly - [X] Adequate tests were added if applicable - [X] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6639) Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
Amrish Lal created HUDI-6639: Summary: Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation Key: HUDI-6639 URL: https://issues.apache.org/jira/browse/HUDI-6639 Project: Apache Hudi Issue Type: Bug Reporter: Amrish Lal -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750879#comment-17750879 ] Krishen Bhan edited comment on HUDI-6596 at 8/3/23 8:15 PM: Thanks for the reply! > The table lock could become a bottleneck, potentially leading to performance > issues as other operations might be blocked too. It might be useful to > consider how frequently you expect concurrent rollbacks to occur and whether > this might create a performance problem. Based on our use case at least, I would expect attempted concurrent rollbacks to be uncommon / an edge case for now, and more rare as my organization internally makes more fixes/changes to our orchestration process. But you bring up a good concern that because this locks around the scheduledRollback call , it will block other hudi writers from progressing while it generates the rollback plan, which might be relatively time-consuming (since in spark HUDI 0.10 can lead to launching a new spark stage) at least compared to other places where HUDI holds the table lock. If this becomes a concern for other HUDI users (and we want HUDI OCC locking to avoid holding table lock while reading/creating an arbitrary/unbounded number of instants/data files in parallel), one solution I can think of would be to refactor HoodieTable scheduleRollback call such that we can first create the rollback plan before acquiring the table lock (in my proposed approach), and then later if we actually call HoodieTable scheduleRollback we just pass in this existing rollback plan. This way the actual "work" needed to generate rollback plan is done beforehand before locking, and we only check and update instant file(s) while under lock (and since in my proposed approach we anyway only schedule rollback if none has ever been scheduled before, I think that we shouldn't have to worry about the rollback plan we created becoming invalid/stale). Though I'm not sure how feasible this is since it since it may affect public APIs. > Can we ensure rollbacks are idempotent in case of repeated failures or > retries? Yes making sure rollbacks are idempotent ( in the sense that a pending rollback can keep on being retried until success) is a must. Both the current/original implementation and the proposed approach should address this. Both approaches/implementations handle the case where rollback is pending but the instant to rollback is gone (where we need to re-materialize the instant info and tell the rollback execution to not actually delete instants). Also, in the proposed approach here in step 3 we are directly using the pending rollback instant that we observe in the timeline, if one exists. Unfortunately the logic and my phrasing for step 3 is a bit awkward, since because the caller can pass in a pendingRollbackInfo that it expects to be executed, I decided that we couldn't just ignore it, but rather we needed to make validate that this pendingRollbackInfo is the same as the pending rollback instant we just saw in the timeline, and abort the rollback if not. > Worth considering edge cases where heartbeats could become stale or be missed > (e.g., if a job crashes without properly closing its heartbeat). Handling > these scenarios gracefully will help ensure that rollbacks can still proceed > when needed. Yeah, if a failed rollback job doesn't clean up the heartbeat after failure, any rollback attempt right after (within the interval of `[heartbeat timeout * (allowed heartbeat misses + 1)]` I think) will fail. The alternative (that I can think of) would be to simply allow for the chance of 2+ rollback jobs to work on the same rollback instant. The issue though is that even if the HoodieTable executeRollback implementation prevents the dataset from being corrupted and will just produce a retry-able failure, I thought that it would be noisy/tricky for a user to understand/debug. So I decided to add Step 4, since from my perspective/understanding it was a tradeoff between "always failing with an easy to understand retry-able exception" versus "rarely failing with a hard to understand retry-able exception". was (Author: JIRAUSER301521): Thanks for the reply! > The table lock could become a bottleneck, potentially leading to performance > issues as other operations might be blocked too. It might be useful to > consider how frequently you expect concurrent rollbacks to occur and whether > this might create a performance problem. Based on our use case at least, I would expect attempted concurrent rollbacks to be uncommon / an edge case for now, and more rare as my organization internally makes more fixes/changes to our orchestration process. But you bring up a good concern that because this locks around the scheduledRollback call , it will block other hudi writers from progressing while it generates the rollback plan,
[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750879#comment-17750879 ] Krishen Bhan commented on HUDI-6596: Thanks for the reply! > The table lock could become a bottleneck, potentially leading to performance > issues as other operations might be blocked too. It might be useful to > consider how frequently you expect concurrent rollbacks to occur and whether > this might create a performance problem. Based on our use case at least, I would expect attempted concurrent rollbacks to be uncommon / an edge case for now, and more rare as my organization internally makes more fixes/changes to our orchestration process. But you bring up a good concern that because this locks around the scheduledRollback call , it will block other hudi writers from progressing while it generates the rollback plan, which might be relatively time-consuming (since in spark HUDI 0.10 can lead to launching a new spark stage) at least compared to other places where HUDI holds the table lock. If this becomes a concern for other HUDI users (and we want HUDI OCC locking to avoid holding table lock while reading/creating an arbitrary/unbounded number of instants/data files in parallel), one solution I can think of would be to refactor HoodieTable scheduleRollback call such that we can first create the rollback plan before acquiring the table lock (in my proposed approach), and then later if we actually call HoodieTable scheduleRollback we just pass in this existing rollback plan. This way the actual "work" needed to generate rollback plan is done beforehand before locking, and we only check and update instant file(s) while under lock (and since in my proposed approach we anyway only schedule rollback if none has ever been scheduled before, I think that we shouldn't have to worry about the rollback plan we created becoming invalid/stale). Though I'm not sure how feasible this is since it since it may affect public APIs. > Can we ensure rollbacks are idempotent in case of repeated failures or > retries? Yes making sure rollbacks are idempotent ( in the sense that a pending rollback can keep on being retried until success) is a must. Both the current/original implementation and the proposed approach should address this. Both approaches/implementations handle the case where rollback is pending but the instant to rollback is gone (where we need to re-materialize the instant info and tell the rollback execution to not actually delete instants). Also, in the proposed approach here in step 3 we are directly using the pending rollback instant that we observe in the timeline, if one exists. Unfortunately the logic and my phrasing for step 3 is a bit awkward, since because the caller can pass in a pendingRollbackInfo that it expects to be executed, I decided that we couldn't just ignore it, but rather we needed to make validate that this pendingRollbackInfo is the same as the pending rollback instant we just saw in the timeline, and abort the rollback if not. > Worth considering edge cases where heartbeats could become stale or be missed > (e.g., if a job crashes without properly closing its heartbeat). Handling > these scenarios gracefully will help ensure that rollbacks can still proceed > when needed. Yeah, if a failed rollback job doesn't clean up the heartbeat after failure, any rollback attempt right after (within the interval of `[heartbeat timeout * (allowed heartbeat misses + 1)]` I think) will fail. The alternative (that I can think of) would be to simply allow for the chance of 2+ rollback jobs to work on the same rollback instant. The issue though is that even if the HoodieTable executeRollback implementation prevents the dataset from being corrupted and will just produce a retry-able failure, I thought that it would be noisy/tricky for a user to understand/debug. So I decided to add Step 4, since from my perspective/understanding it was a tradeoff between "always failing with an easy to understand retry-able exception" versus "rarely failing with a hard to understand retry-able exception". Though I think the chance of "stale heartbeat" can be lowered by updating the commit code path to clean the heartbeat if an exception is raised (this isn't done in 0.10 I think, but not sure if its still like this in 0.14). In fact my organization internally does this with our internal forked/modified version of clustering replacecommit (we have modified replacecommit to not have an immutable plan, similar to commit). Though of course this isn't a guarantee see a HUDI spark job writer might fail with a lower level runtime error. > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL:
[GitHub] [hudi] yihua commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
yihua commented on code in PR #9276: URL: https://github.com/apache/hudi/pull/9276#discussion_r1283625541 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -247,11 +247,23 @@ object DefaultSource { Option(schema) } + + + Review Comment: nit: remove empty lines ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -262,16 +274,28 @@ object DefaultSource { new IncrementalRelation(sqlContext, parameters, userSchema, metaClient) case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) => - new MergeOnReadSnapshotRelation(sqlContext, parameters, metaClient, globPaths, userSchema) + if (newHudiFileFormatUtils.isEmpty || newHudiFileFormatUtils.get.hasSchemaOnRead) { Review Comment: Instead of checking `newHudiFileFormatUtils.get.hasSchemaOnRead`, should `hoodie.schema.on.read.enable` config be checked when lazy creating `newHudiFileFormatUtils` and only check `newHudiFileFormatUtils.isEmpty` here for simplicity? ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala: ## @@ -77,6 +78,9 @@ trait HoodieCatalystPlansUtils { */ def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, LogicalPlan, Expression)] + + def applyNewHoodieParquetFileFormatProjection(plan: LogicalPlan): LogicalPlan Review Comment: nit: add docs to explain what this projection does, e.g. resolving the schema based on the resolver, and why this is needed for the new file format. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -262,16 +274,28 @@ object DefaultSource { new IncrementalRelation(sqlContext, parameters, userSchema, metaClient) case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) => - new MergeOnReadSnapshotRelation(sqlContext, parameters, metaClient, globPaths, userSchema) + if (newHudiFileFormatUtils.isEmpty || newHudiFileFormatUtils.get.hasSchemaOnRead) { +new MergeOnReadSnapshotRelation(sqlContext, parameters, metaClient, globPaths, userSchema) + } else { +newHudiFileFormatUtils.get.getHadoopFsRelation(isMOR = true, isBootstrap = false) + } case (MERGE_ON_READ, QUERY_TYPE_INCREMENTAL_OPT_VAL, _) => new MergeOnReadIncrementalRelation(sqlContext, parameters, metaClient, userSchema) case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, true) => - new HoodieBootstrapMORRelation(sqlContext, userSchema, globPaths, metaClient, parameters) + if (newHudiFileFormatUtils.isEmpty || newHudiFileFormatUtils.get.hasSchemaOnRead) { Review Comment: Similar here and below. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -87,6 +87,14 @@ object DataSourceReadOptions { s"payload implementation to merge (${REALTIME_PAYLOAD_COMBINE_OPT_VAL}) or skip merging altogether" + s"${REALTIME_SKIP_MERGE_OPT_VAL}") + val LEGACY_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty Review Comment: ```suggestion val USE_LEGACY_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty ``` ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -247,11 +247,23 @@ object DefaultSource { Option(schema) } + + + if (metaClient.getCommitsTimeline.filterCompletedInstants.countInstants() == 0) { new EmptyRelation(sqlContext, resolveSchema(metaClient, parameters, Some(schema))) } else if (isCdcQuery) { CDCRelation.getCDCRelation(sqlContext, metaClient, parameters) } else { + lazy val newHudiFileFormatUtils = if (!parameters.getOrElse(LEGACY_HUDI_PARQUET_FILE_FORMAT.key, +LEGACY_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && (globPaths == null || globPaths.isEmpty) +&& parameters.getOrElse(REALTIME_MERGE.key(), REALTIME_MERGE.defaultValue()) +.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL)) { Review Comment: Is there any issue with `REALTIME_SKIP_MERGE_OPT_VAL` merge type? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -144,22 +146,41 @@ case class HoodieFileIndex(spark: SparkSession, // Prune the partition path by the partition filters // NOTE: Non-partitioned tables are assumed to consist from a single partition // encompassing the whole table -val prunedPartitions = listMatchingPartitionPaths(partitionFilters) +val prunedPartitions = if (shouldBroadcast) { + listMatchingPartitionPaths(convertFilterForTimestampKeyGenerator(metaClient, partitionFilters)) +} else { +
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664552015 ## CI report: * 7a0e3786173a124177688946c37011577fe97478 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18751) * 214bc9b6f9c0d522f4196cca12daf4756dc96439 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases
hudi-bot commented on PR #8697: URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664533286 ## CI report: * 7a0e3786173a124177688946c37011577fe97478 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18751) * 214bc9b6f9c0d522f4196cca12daf4756dc96439 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible
hudi-bot commented on PR #9337: URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664517346 ## CI report: * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664516647 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * af768285ff85e68a361482806a5de1e0b9c272a9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19035) * 1875a19fd05f413373eb1f2400f390706d62725e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9308: [HUDI-6606] Use record level index with SQL equality queries
hudi-bot commented on PR #9308: URL: https://github.com/apache/hudi/pull/9308#issuecomment-1664516962 ## CI report: * 5195761aea8a8d9cfc046978601d10a246507da8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19028) * ae0002b81c71f77e0c19aeb3e5872b0eb399ddb6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19043) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664516511 ## CI report: * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Update bootstrap page (#9338)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 8cd33fa3ad8 [DOCS] Update bootstrap page (#9338) 8cd33fa3ad8 is described below commit 8cd33fa3ad842afd900b1aef00e93fdf1bbe6e7f Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Thu Aug 3 12:05:54 2023 -0700 [DOCS] Update bootstrap page (#9338) * [HUDI-6112] Fix bugs in Doc Generation tool - Add Config Param in Description - Styling changes to fix table size and toc on side for better navigation - Bug fix in basic configs page to merge spark datasource related read and write configs * [DOCS] Update bootstrap page with configs --- website/docs/migration_guide.md| 49 +++--- website/src/theme/DocPage/index.js | 2 +- 2 files changed, 41 insertions(+), 10 deletions(-) diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md index 31cbdbcb956..ce36f3a38f0 100644 --- a/website/docs/migration_guide.md +++ b/website/docs/migration_guide.md @@ -3,6 +3,9 @@ title: Bootstrapping keywords: [ hudi, migration, use case] summary: In this page, we will discuss some available tools for migrating your existing table into a Hudi table last_modified_at: 2019-12-30T15:59:57-04:00 +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 --- Hudi maintains metadata such as commit timeline and indexes to manage a table. The commit timelines helps to understand the actions happening on a table as well as the current state of a table. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats. @@ -35,12 +38,20 @@ Import your existing table into a Hudi managed table. Since all the data is Hudi There are a few options when choosing this approach. -**Option 1** -Use the HoodieStreamer tool. HoodieStreamer supports bootstrap with --run-bootstrap command line option. There are two types of bootstrap, -METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. -FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table. + Using Hudi Streamer + +Use the [Hudi Streamer](/docs/hoodie_deltastreamer#hudi-streamer) tool. HoodieStreamer supports bootstrap with +--run-bootstrap command line option. There are two types of bootstrap, METADATA_ONLY and FULL_RECORD. METADATA_ONLY will +generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. FULL_RECORD will +perform a full copy/rewrite of the data as a Hudi table. Additionally, once can choose selective partitions using regex +patterns to apply one of the above bootstrap modes. + +Here is an example for running FULL_RECORD bootstrap on all partitions that match the regex pattern `.*` and keeping +hive style partition with HoodieStreamer. This example configures +[hoodie.bootstrap.mode.selector](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselector) to +`org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector` which allows applying `FULL_RECORD` bootstrap +mode to selective partitions based on the regex pattern [hoodie.bootstrap.mode.selector.regex](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselectorregex) -Here is an example for running FULL_RECORD bootstrap and keeping hive style partition with HoodieStreamer. ``` spark-submit --master local \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ @@ -54,13 +65,14 @@ spark-submit --master local \ --hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \ --hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \ --hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \ ---hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \ --hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \ +--hoodie-conf hoodie.bootstrap.mode.selector.regex='.*' \ --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true ``` -**Option 2** + Using Spark Datasource Writer + For huge tables, this could be as simple as : ```java for partition in [list of partitions in source table] { @@ -69,7 +81,12 @@ for partition in [list of partitions in source table] { } ``` -**Option 3** + Using Spark SQL CALL Procedure + +Refer to [Bootstrap procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more details. +
[GitHub] [hudi] bhasudha merged pull request #9338: [DOCS] Update bootstrap page
bhasudha merged PR #9338: URL: https://github.com/apache/hudi/pull/9338 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
amrishlal commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1283598675 ## pom.xml: ## @@ -98,8 +98,6 @@ ${fasterxml.spark3.version} ${fasterxml.spark3.version} ${fasterxml.spark3.version} - - Review Comment: @danny0405 For now, we seem to be shading all the jackson packages and hence it seems like a good idea to shade jackson JSR package as well to keep the versions across jackson packages consistent (otherwise testing may become problematic). But yes, as a future item we could look into not shading any of the jackson packages at all for spark versions (>= 3.1 I think) where Jackson is already being included. We would also need to confirm on the Flink side (as I was seeing some errors in Flink validate bundles tests). For now though we would like to get our integration test setup up and running as that is blocked by this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #9338: [DOCS] Update bootstrap page
bhasudha commented on PR #9338: URL: https://github.com/apache/hudi/pull/9338#issuecomment-1664460516 ![Screenshot 2023-08-03 at 11 40 19 AM](https://github.com/apache/hudi/assets/2179254/2ef881a4-6537-4465-9a03-9e4fe7cf99f8) ![Screenshot 2023-08-03 at 11 40 28 AM](https://github.com/apache/hudi/assets/2179254/06ce18e7-66fd-4534-9a18-ab188ab84de8) ![Screenshot 2023-08-03 at 11 40 36 AM](https://github.com/apache/hudi/assets/2179254/99d417e7-630c-4bdb-b36a-6615874df51b) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org