Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
hudi-bot commented on PR #9761: URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773677812 ## CI report: * d680ff4d9daad14fea165c1520ef906606e991a5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
hudi-bot commented on PR #9761: URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773646118 ## CI report: * 60875bdf9343f5f48a862393ff5845063cf549d7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20322) * d680ff4d9daad14fea165c1520ef906606e991a5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20426) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
hudi-bot commented on PR #9761: URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773642737 ## CI report: * 60875bdf9343f5f48a862393ff5845063cf549d7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20322) * d680ff4d9daad14fea165c1520ef906606e991a5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1367634939 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -255,40 +255,47 @@ object DefaultSource { Option.empty } - (tableType, queryType, isBootstrappedTable) match { -case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) | - (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) | - (MERGE_ON_READ, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) => + val isMultipleBaseFileFormatsEnabled = metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled + (tableType, queryType, isBootstrappedTable, isMultipleBaseFileFormatsEnabled) match { +case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) | + (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false, true) => + new HoodieMultiFileFormatCOWRelation(sqlContext, metaClient, parameters, userSchema, globPaths) +case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) | Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1367634339 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMultiFileFormatRelation.scala: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hudi.HoodieBaseRelation.projectReader +import org.apache.hudi.HoodieConversionUtils.toScalaOption +import org.apache.hudi.HoodieMultiFileFormatRelation.{createPartitionedFile, inferFileFormat} +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, HoodieLogFile} +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.execution.datasources.{FilePartition, PartitionedFile} +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.StructType + +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * Base split for all Hoodie multi-file format relations. + */ +case class HoodieMultiFileFormatSplit(baseFile: Option[PartitionedFile], + logFiles: List[HoodieLogFile]) extends HoodieFileSplit + +/** + * Base relation to handle table with multiple base file formats. + */ +abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: SQLContext, + override val metaClient: HoodieTableMetaClient, Review Comment: Yes, we are using the file extension to figure out the base file format. However, the usual `BaseFileOnlyRelation` converts to `HadoopFsRelation`, which cannot be done for multiple file formats because the file format is not known at the time creating the relation. It is only known when we the relation is collecting file splits. Hence, a separate relation to handle multiple file formats. I will add this in scaladoc of the class. Plus, I think a separate relation keeps the code clean and more maintainable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1367633031 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -516,6 +515,99 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, */ def updatePrunedDataSchema(prunedSchema: StructType): Relation + protected def createBaseFileReaders(tableSchema: HoodieTableSchema, + requiredSchema: HoodieTableSchema, + requestedColumns: Array[String], + requiredFilters: Seq[Filter], + optionalFilters: Seq[Filter] = Seq.empty, + baseFileFormat: HoodieFileFormat = tableConfig.getBaseFileFormat): HoodieMergeOnReadBaseFileReaders = { +val (partitionSchema, dataSchema, requiredDataSchema) = + tryPrunePartitionColumns(tableSchema, requiredSchema) + +val fullSchemaReader = createBaseFileReader( + spark = sqlContext.sparkSession, + partitionSchema = partitionSchema, + dataSchema = dataSchema, + requiredDataSchema = dataSchema, + // This file-reader is used to read base file records, subsequently merging them with the records + // stored in delta-log files. As such, we have to read _all_ records from the base file, while avoiding + // applying any filtering _before_ we complete combining them w/ delta-log records (to make sure that + // we combine them correctly); + // As such only required filters could be pushed-down to such reader + filters = requiredFilters, + options = optParams, + // NOTE: We have to fork the Hadoop Config here as Spark will be modifying it + // to configure Parquet reader appropriately + hadoopConf = embedInternalSchema(new Configuration(conf), internalSchemaOpt), + baseFileFormat = baseFileFormat +) + +val requiredSchemaReader = createBaseFileReader( + spark = sqlContext.sparkSession, + partitionSchema = partitionSchema, + dataSchema = dataSchema, + requiredDataSchema = requiredDataSchema, + // This file-reader is used to read base file records, subsequently merging them with the records + // stored in delta-log files. As such, we have to read _all_ records from the base file, while avoiding + // applying any filtering _before_ we complete combining them w/ delta-log records (to make sure that + // we combine them correctly); + // As such only required filters could be pushed-down to such reader + filters = requiredFilters, + options = optParams, + // NOTE: We have to fork the Hadoop Config here as Spark will be modifying it + // to configure Parquet reader appropriately + hadoopConf = embedInternalSchema(new Configuration(conf), requiredDataSchema.internalSchema), + baseFileFormat = baseFileFormat +) + +// Check whether fields required for merging were also requested to be fetched +// by the query: +//- In case they were, there's no optimization we could apply here (we will have +//to fetch such fields) +//- In case they were not, we will provide 2 separate file-readers +//a) One which would be applied to file-groups w/ delta-logs (merging) +//b) One which would be applied to file-groups w/ no delta-logs or +// in case query-mode is skipping merging +val mandatoryColumns = mandatoryFields.map(HoodieAvroUtils.getRootLevelFieldName) +if (mandatoryColumns.forall(requestedColumns.contains)) { + HoodieMergeOnReadBaseFileReaders( +fullSchemaReader = fullSchemaReader, +requiredSchemaReader = requiredSchemaReader, +requiredSchemaReaderSkipMerging = requiredSchemaReader + ) +} else { + val prunedRequiredSchema = { +val unusedMandatoryColumnNames = mandatoryColumns.filterNot(requestedColumns.contains) +val prunedStructSchema = + StructType(requiredDataSchema.structTypeSchema.fields +.filterNot(f => unusedMandatoryColumnNames.contains(f.name))) + +HoodieTableSchema(prunedStructSchema, convertToAvroSchema(prunedStructSchema, tableName).toString) + } + + val requiredSchemaReaderSkipMerging = createBaseFileReader( +spark = sqlContext.sparkSession, +partitionSchema = partitionSchema, +dataSchema = dataSchema, +requiredDataSchema = prunedRequiredSchema, +// This file-reader is only used in cases when no merging is performed, therefore it's safe to push +// down these filters to the base file readers +filters = requiredFilters ++ optionalFilters, +options = optParams, +// NOTE: We have to fork the Hadoop Config here as Spark will be modifying it +// to configure Parquet reader appropriately
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1367623109 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java: ## @@ -553,7 +554,12 @@ public static void writeDataAsBatch( * Initializes a writing pipeline with given configuration. */ public static TestFunctionWrapper getWritePipeline(String basePath, Configuration conf) throws Exception { -if (OptionsResolver.isAppendMode(conf)) { +if (OptionsResolver.isBulkInsertOperation(conf)) { + if (conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT)) { +throw new UnsupportedOperationException("Harness test does not support bulk insert sort yet"); Review Comment: The sort is by default disabled ? Then we can remove this check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1367622937 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java: ## @@ -305,6 +305,34 @@ public TestHarness checkpointComplete(long checkpointId) { return this; } +/** + * Flush data using endInput. Asserts the commit would fail. + */ +public void commitAsBatchThrows(Class cause, String msg) { + // this triggers the data write and event send + this.pipeline.endInput(); + final OperatorEvent nextEvent = this.pipeline.getNextEvent(); + this.pipeline.getCoordinator().handleEventFromOperator(0, nextEvent); + assertTrue(this.pipeline.getCoordinatorContext().isJobFailed(), "Job should have been failed"); + Throwable throwable = this.pipeline.getCoordinatorContext().getJobFailureReason().getCause(); + assertThat(throwable, instanceOf(cause)); + assertThat(throwable.getMessage(), containsString(msg)); +} + +/** + * Flush data using endInput. Asserts the commit would fail. + */ +public TestHarness commitAsBatch() { + // this triggers the data write and event send + this.pipeline.endInput(); + final OperatorEvent nextEvent = this.pipeline.getNextEvent(); + this.pipeline.getCoordinator().handleEventFromOperator(0, nextEvent); + if (this.pipeline.getCoordinatorContext().isJobFailed()) { + System.err.println(this.pipeline.getCoordinatorContext().getJobFailureReason().getCause()); Review Comment: We do not print in test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1367620324 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java: ## @@ -387,7 +387,7 @@ private void addEventToBuffer(WriteMetadataEvent event) { private void startInstant() { // refresh the last txn metadata -this.writeClient.preTxn(this.metaClient); + this.writeClient.preTxn(WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)), this.metaClient); // put the assignment in front of metadata generation, Review Comment: operation type is included in the `TableState`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1367619653 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } - public boolean isNonBlockingConcurrencyControl() { -return getTableType().equals(HoodieTableType.MERGE_ON_READ) -&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() -&& isSimpleBucketIndex(); + public boolean needResolveWriteConflict(WriteOperationType operationType) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && isSimpleBucketIndex()) { +return operationType == WriteOperationType.UNKNOWN || operationType == WriteOperationType.BULK_INSERT; + } else { +return true; Review Comment: Can we simplify the logic as: ```java return is_nb_cc ? operation == BULK_INSERT : true; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]
danny0405 commented on code in PR #9878: URL: https://github.com/apache/hudi/pull/9878#discussion_r1367616667 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java: ## @@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context context) { // bootstrap final DataStream hoodieRecordDataStream = Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), overwrite); + // write pipeline pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream); - // compaction + + // insert cluster mode + if (OptionsResolver.isInsertClusterMode(conf)) { +return Pipelines.clean(conf, pipeline); + } Review Comment: Can you check the Travis test failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6965) Flink Quickstart Restructuring
[ https://issues.apache.org/jira/browse/HUDI-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6965. Resolution: Fixed Fixed via asf-site: 46445fa14aa4bfc9d364f228278753870e5c21dc > Flink Quickstart Restructuring > -- > > Key: HUDI-6965 > URL: https://issues.apache.org/jira/browse/HUDI-6965 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch asf-site updated: [HUDI-6965][Docs] Flink Quickstart Revamp (#9897)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 46445fa14aa [HUDI-6965][Docs] Flink Quickstart Revamp (#9897) 46445fa14aa is described below commit 46445fa14aa4bfc9d364f228278753870e5c21dc Author: Aditya Goenka <63430370+ad1happy...@users.noreply.github.com> AuthorDate: Sat Oct 21 08:32:18 2023 +0530 [HUDI-6965][Docs] Flink Quickstart Revamp (#9897) --- website/docs/flink-quick-start-guide.md | 313 ++-- 1 file changed, 176 insertions(+), 137 deletions(-) diff --git a/website/docs/flink-quick-start-guide.md b/website/docs/flink-quick-start-guide.md index acec412e41f..cfba29ad8db 100644 --- a/website/docs/flink-quick-start-guide.md +++ b/website/docs/flink-quick-start-guide.md @@ -7,20 +7,9 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; This page introduces Flink-Hudi integration. We can feel the unique charm of how Flink brings in the power of streaming into Hudi. -This guide helps you quickly start using Flink on Hudi, and learn different modes for reading/writing Hudi by Flink: +This guide helps you quickly start using Flink on Hudi, and learn different modes for reading/writing Hudi by Flink. -- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi. -- **Configuration** : For [Global Configuration](/docs/next/flink_tuning#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](/docs/next/flink_tuning#table-options). -- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append Mode](/docs/hoodie_streaming_ingestion#append-mode). -- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](/docs/querying_data#streaming-query) and [Incremental Query](/docs/querying_data#incremental-query). -- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and [Write Rate Limit](/docs/next/flink_tuning#write-rate-limit). -- **Optimization**: Offline compaction is supported [Offline Compaction](/docs/compaction#flink-offline-compaction). -- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](/docs/syncing_metastore#flink-setup), [Presto Query](/docs/querying_data#prestodb). -- **Catalog**: A Hudi specific catalog is supported: [Hudi Catalog](/docs/querying_data/#hudi-catalog). - -## Quick Start - -### Setup +## Setup https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sqlclient/) because it's a good quick start tool for SQL users. - Step.1 download Flink jar - -Hudi works with both Flink 1.13, Flink 1.14, Flink 1.15, Flink 1.16 and Flink 1.17. You can follow the -instructions [here](https://flink.apache.org/downloads) for setting up Flink. Then choose the desired Hudi-Flink bundle -jar to work with different Flink and Scala versions: - -- `hudi-flink1.13-bundle` -- `hudi-flink1.14-bundle` -- `hudi-flink1.15-bundle` -- `hudi-flink1.16-bundle` -- `hudi-flink1.17-bundle` +### Download Flink - Step.2 start Flink cluster -Start a standalone Flink cluster within hadoop environment. -Before you start up the cluster, we suggest to config the cluster as follows: +Hudi works with Flink 1.13, Flink 1.14, Flink 1.15, Flink 1.16 and Flink 1.17. You can follow the +instructions [here](https://flink.apache.org/downloads) for setting up Flink. -- in `$FLINK_HOME/conf/flink-conf.yaml`, add config option `taskmanager.numberOfTaskSlots: 4` -- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations according to the characteristics of your task](/docs/next/flink_tuning#global-configurations) -- in `$FLINK_HOME/conf/workers`, add item `localhost` as 4 lines so that there are 4 workers on the local cluster - -Now starts the cluster: +### Start Flink cluster +Start a standalone Flink cluster within hadoop environment. In case we are trying on local setup, then we could download hadoop binaries and set HADOOP_HOME. ```bash # HADOOP_HOME is your hadoop root directory after unpack the binary package. @@ -63,21 +38,34 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` # Start the Flink standalone cluster ./bin/start-cluster.sh ``` - Step.3 start Flink SQL client +### Start Flink SQL client Hudi supports packaged bundle jar for Flink, which should be loaded in the Flink SQL Client when
[jira] [Created] (HUDI-6965) Flink Quickstart Restructuring
Danny Chen created HUDI-6965: Summary: Flink Quickstart Restructuring Key: HUDI-6965 URL: https://issues.apache.org/jira/browse/HUDI-6965 Project: Apache Hudi Issue Type: Improvement Components: docs Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [Docs] Flink Quickstart Restructuring [hudi]
danny0405 merged PR #9897: URL: https://github.com/apache/hudi/pull/9897 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p
danny0405 commented on issue #8614: URL: https://github.com/apache/hudi/issues/8614#issuecomment-1773622757 @ad1happy2go Would you mind to reproduce with the given cmd by @pushpavanthar ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1367610034 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -821,6 +820,8 @@ public static class PropertyBuilder { private String metadataPartitions; private String inflightMetadataPartitions; private String secondaryIndexesMetadata; +private Boolean multipleBaseFileFormatsEnabled; +private String baseFileFormats; Review Comment: removed.. only keeping the boolean config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1367606971 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java: ## @@ -866,11 +866,11 @@ public void validateInsertSchema() throws HoodieInsertException { } public HoodieFileFormat getBaseFileFormat() { -return metaClient.getTableConfig().getBaseFileFormat(); - } - - public HoodieFileFormat getLogFileFormat() { -return metaClient.getTableConfig().getLogFileFormat(); +HoodieTableConfig tableConfig = metaClient.getTableConfig(); +if (tableConfig.contains(HoodieTableConfig.BASE_FILE_FORMAT)) { + return metaClient.getTableConfig().getBaseFileFormat(); +} +return config.getBaseFileFormat(); Review Comment: checking table config first for backward compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1773523545 ## CI report: * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * c580c03a82bb25a639e0659b779cdf10acfaf33e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20425) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1773483151 ## CI report: * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420) * c580c03a82bb25a639e0659b779cdf10acfaf33e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6961) Deletes with custom delete field not working with DefaultHoodieRecordPayload
[ https://issues.apache.org/jira/browse/HUDI-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6961: Story Points: 5 (was: 3) > Deletes with custom delete field not working with DefaultHoodieRecordPayload > > > Key: HUDI-6961 > URL: https://issues.apache.org/jira/browse/HUDI-6961 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > When configuring custom delete key and delete marker with > DefaultHoodieRecordPayload, writing fails with the deletes in the batch: > {code:java} > Error for key:HoodieKey { recordKey=0 partitionPath=} is > java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.common.model.HoodieAvroRecord.prependMetaFields(HoodieAvroRecord.java:132) > at > org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:144) > at > org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:180) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42) > at > org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:69) > at > org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80) > at > org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) > at > org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Row writer optimization for bulk insert [hudi]
hudi-bot commented on PR #9852: URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773365413 ## CI report: * 6e9970649492dd16f26a02d5842956d4212ec31d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20423) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6952) Skip reading the uncommitted log files for log reader
[ https://issues.apache.org/jira/browse/HUDI-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6952. Resolution: Fixed Fixed via master branch: c9c7c1019f5ca903ad3782f79370663f5ae26cb9 > Skip reading the uncommitted log files for log reader > - > > Key: HUDI-6952 > URL: https://issues.apache.org/jira/browse/HUDI-6952 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6952] Skip reading the uncommitted log files for log reader (#9879)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c9c7c1019f5 [HUDI-6952] Skip reading the uncommitted log files for log reader (#9879) c9c7c1019f5 is described below commit c9c7c1019f5ca903ad3782f79370663f5ae26cb9 Author: Danny Chan AuthorDate: Sat Oct 21 04:21:14 2023 +0800 [HUDI-6952] Skip reading the uncommitted log files for log reader (#9879) This is to avoid potential exceptions when the reader is processing an uncommitted log file while the cleaning or rollback service removes the log file. --- .../hudi/client/TestCompactionAdminClient.java | 14 ++-- .../table/log/AbstractHoodieLogRecordReader.java | 24 ++- .../table/timeline/CompletionTimeQueryView.java| 19 +++-- .../table/timeline/HoodieDefaultTimeline.java | 57 +++ .../table/view/AbstractTableFileSystemView.java| 84 +- .../table/view/TestHoodieTableFileSystemView.java | 23 +++--- 6 files changed, 129 insertions(+), 92 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java index 4177297a6ba..20677bf7c85 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java @@ -20,7 +20,6 @@ package org.apache.hudi.client; import org.apache.hudi.client.CompactionAdminClient.ValidationOpResult; import org.apache.hudi.common.model.CompactionOperation; -import org.apache.hudi.common.model.HoodieFileGroup; import org.apache.hudi.common.model.HoodieLogFile; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.view.HoodieTableFileSystemView; @@ -208,7 +207,6 @@ public class TestCompactionAdminClient extends HoodieClientTestBase { final HoodieTableFileSystemView newFsView = new HoodieTableFileSystemView(metaClient, metaClient.getCommitsAndCompactionTimeline()); Set commitsWithDataFile = CollectionUtils.createSet("000", "004"); -Set commitsWithLogAfterCompactionRequest = CollectionUtils.createSet("000", "002"); // Expect each file-slice whose base-commit is same as compaction commit to contain no new Log files newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0], compactionInstant, true) .filter(fs -> fs.getBaseInstantTime().compareTo(compactionInstant) <= 0) @@ -218,16 +216,12 @@ public class TestCompactionAdminClient extends HoodieClientTestBase { } else { assertFalse(fs.getBaseFile().isPresent(), "No Data file should be present"); } - if (commitsWithLogAfterCompactionRequest.contains(fs.getBaseInstantTime())) { -assertEquals(4, fs.getLogFiles().count(), "Has Log Files"); - } else { -assertEquals(2, fs.getLogFiles().count(), "Has Log Files"); - } + assertEquals(2, fs.getLogFiles().count(), "Has Log Files"); }); // Ensure same number of log-files before and after renaming per fileId Map fileIdToCountsAfterRenaming = - newFsView.getAllFileGroups(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0]).flatMap(HoodieFileGroup::getAllFileSlices) + newFsView.getLatestMergedFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0], ingestionInstant) .filter(fs -> fs.getBaseInstantTime().equals(ingestionInstant)) .map(fs -> Pair.of(fs.getFileId(), fs.getLogFiles().count())) .collect(Collectors.toMap(Pair::getKey, Pair::getValue)); @@ -246,7 +240,7 @@ public class TestCompactionAdminClient extends HoodieClientTestBase { ensureValidCompactionPlan(compactionInstant); // Check suggested rename operations -metaClient = HoodieTableMetaClient.builder().setConf(metaClient.getHadoopConf()).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build(); +metaClient.reloadActiveTimeline(); // Log files belonging to file-slices created because of compaction request must be renamed @@ -277,7 +271,7 @@ public class TestCompactionAdminClient extends HoodieClientTestBase { // Ensure same number of log-files before and after renaming per fileId Map fileIdToCountsAfterRenaming = - newFsView.getAllFileGroups(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0]).flatMap(HoodieFileGroup::getAllFileSlices) + newFsView.getLatestMergedFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0], ingestionInstant) .filter(fs -> fs.getBaseInstantTime().equals(ingestionInstant)) .filter(fs -> fs.getFileId().equals(op.getF
Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
danny0405 merged PR #9879: URL: https://github.com/apache/hudi/pull/9879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
danny0405 commented on PR #9879: URL: https://github.com/apache/hudi/pull/9879#issuecomment-1773345831 Test has passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20416&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Row writer optimization for bulk insert [hudi]
hudi-bot commented on PR #9852: URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773317660 ## CI report: * c9d09be26e354d8fb4efdc51e1e1e9a6da7d2031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20413) * 6e9970649492dd16f26a02d5842956d4212ec31d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20423) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Row writer optimization for bulk insert [hudi]
hudi-bot commented on PR #9852: URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773308489 ## CI report: * c9d09be26e354d8fb4efdc51e1e1e9a6da7d2031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20413) * 6e9970649492dd16f26a02d5842956d4212ec31d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy [hudi]
ksmou closed pull request #9126: [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy URL: https://github.com/apache/hudi/pull/9126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6565] Spark offline compaction add failed retry mechanism [hudi]
ksmou closed pull request #9229: [HUDI-6565] Spark offline compaction add failed retry mechanism URL: https://github.com/apache/hudi/pull/9229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Compaction error [hudi]
fearlsgroove commented on issue #9885: URL: https://github.com/apache/hudi/issues/9885#issuecomment-1773063344 Yeap i can read the data succcessfully. Schema shows the only decimal value as a (12,2) value which is what I expect. Is there a way to try to read whatever records in the logs due for compaction might be causing this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]
njalan commented on issue #9751: URL: https://github.com/apache/hudi/issues/9751#issuecomment-1773018752 @ad1happy2go May I know any updates from you? If can't reduce object list , can we cache these metadatas on driver? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Remove /next reference from already published doc versions [hudi]
bhasudha merged PR #9893: URL: https://github.com/apache/hudi/pull/9893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6964) Fix HoodieStreamer writing json value issue
[ https://issues.apache.org/jira/browse/HUDI-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-6964: --- Summary: Fix HoodieStreamer writing json value issue (was: Fix writing json value issue) > Fix HoodieStreamer writing json value issue > --- > > Key: HUDI-6964 > URL: https://issues.apache.org/jira/browse/HUDI-6964 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Fix For: 0.15.0 > > > when value is a json string, HoodieDeltaStreamer will write it as a map > string, which is inconsistent with the original data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6964) Fix writing json value issue
[ https://issues.apache.org/jira/browse/HUDI-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-6964: --- Summary: Fix writing json value issue (was: Fix writing json format issue) > Fix writing json value issue > > > Key: HUDI-6964 > URL: https://issues.apache.org/jira/browse/HUDI-6964 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Fix For: 0.15.0 > > > when value is a json string, HoodieDeltaStreamer will write it as a map > string, which is inconsistent with the original data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6964) Fix writing json format issue
Xianghu Wang created HUDI-6964: -- Summary: Fix writing json format issue Key: HUDI-6964 URL: https://issues.apache.org/jira/browse/HUDI-6964 Project: Apache Hudi Issue Type: Bug Components: deltastreamer Reporter: Xianghu Wang Assignee: Xianghu Wang Fix For: 0.15.0 when value is a json string, HoodieDeltaStreamer will write it as a map string, which is inconsistent with the original data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [Docs] Flink Quickstart Restructuring [hudi]
ad1happy2go opened a new pull request, #9897: URL: https://github.com/apache/hudi/pull/9897 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772748273 ## CI report: * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.blo… [hudi]
hudi-bot commented on PR #9886: URL: https://github.com/apache/hudi/pull/9886#issuecomment-1772665871 ## CI report: * 9456e57d5d87ce8e2db946c03b2075e2d45b6826 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20419) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] KryoException while using spark-hudi [hudi]
ad1happy2go commented on issue #9891: URL: https://github.com/apache/hudi/issues/9891#issuecomment-1772606850 @akshayakp97 Did you tried 0.12.3 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p
pushpavanthar commented on issue #8614: URL: https://github.com/apache/hudi/issues/8614#issuecomment-1772585088 @danny0405 we are facing same issue in Hudi version 0.13.1 and spark version 3.2.1 and 3.3.2. Below is the command we use to run, Same command used to work fine with Hudi 0.11.1. `spark-submit --master yarn --packages org.apache.spark:spark-avro_2.12:3.2.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.13.1,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.1,org.apache.hudi:hudi-aws-bundle:0.13.1 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.cores=5 --conf spark.driver.memory=3200m --conf spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1400m --conf spark.executor.memory=14600m --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=21 --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --conf spark.sql.catalogImplementation=hive --deploy-mode client s3://bucket_name/custom_jar-2.0.jar --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=100 --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.HoodieIncrSource --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://bucket_name/ml_attributes/features --hoodie-conf hoodie.metrics.on=true --hoodie-conf hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf hoodie.metrics.pushgateway.host=pushgateway.in --hoodie-conf hoodie.metrics.pushgateway.port=443 --hoodie-conf hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf hoodie.metrics.pushgateway.job.name=hudi_transformed_features_accounts_hudi --hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf hoodie.metadata.enable=true --hoodie-conf hoodie.metrics.reporter.metricsname.pr efix=hudi --target-base-path s3://bucket_name_transformed/features_accounts --target-table features_accounts --enable-sync --hoodie-conf hoodie.datasource.hive_sync.database=hudi_transformed --hoodie-conf hoodie.datasource.hive_sync.table=features_accounts --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool --hoodie-conf hoodie.datasource.write.recordkey.field=id,pos --hoodie-conf hoodie.datasource.write.precombine.field=id --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator --hoodie-conf hoodie.datasource.write.partitionpath.field=created_at_dt --hoodie-conf hoodie.datasource.hive_sync.partition_fields=created_at_dt --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd'T'HH:mm:ss.SS'Z', -MM-dd' 'HH:mm:ss.SS,-MM-dd' 'HH:mm:ss,-MM-dd'T'HH:mm:ss'Z'" --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd --source-ordering-field id --hoodie-conf secret.key.name=some-secret --hoodie-conf transformer.decrypt.cols=features_json --hoodie-conf transformer.uncompress.cols=false --hoodie-conf transformer.jsonToStruct.column=features_json --hoodie-conf transformer.normalize.column=features_json.accounts --hoodie-conf transformer.copy.fields=created_at,created_at_dt --transformer-class com.custom.transform.DecryptTransformer,com.custom.transform.JsonToStructTypeTransformer,com.custom.transform.NormalizeArrayTransformer,com.custom.transform.FlatteningTransformer,com.custom.transform.CopyFieldTransformer` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772576148 ## CI report: * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421) * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772497032 ## CI report: * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415) * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418) * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421) * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772496955 ## CI report: * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]
KnightChess commented on issue #9615: URL: https://github.com/apache/hudi/issues/9615#issuecomment-1772439298 > 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: another attempt succeeded > 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: Stage finished @boneanxs yes, has these log `try to kill`. there has two issues. the iterator in compact is uninterruptible, can not kill immediately, but it doesn't completely solve the problem if fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772421485 ## CI report: * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415) * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418) * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421) * 72489cb31221dfb8152f683cd9140b401134c9ee UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772408526 ## CI report: * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415) * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418) * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]
hudi-bot commented on PR #9879: URL: https://github.com/apache/hudi/pull/9879#issuecomment-1772408300 ## CI report: * 94301a15c9f75355c0ebf5bab3baf6226820ac42 UNKNOWN * ebc4df431089161ff933f125829bc7975bb04509 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20416) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772344024 ## CI report: * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415) * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418) * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772331421 ## CI report: * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415) * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418) * 7a44dd88d5a5c72d3b54fb66412b82113e803577 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
hudi-bot commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1772317834 ## CI report: * 6eb38c5b78e4f53a7d21c15f5bb63cd528bc5132 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20414) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772317758 ## CI report: * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417) * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366628134 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.utils; + +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsResolver; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.sink.StreamWriteOperatorCoordinator; +import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper; +import org.apache.hudi.sink.bulk.BulkInsertWriteFunction; +import org.apache.hudi.sink.bulk.RowDataKeyGen; +import org.apache.hudi.sink.event.WriteMetadataEvent; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.api.common.ExecutionConfig; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.runtime.io.disk.iomanager.IOManager; +import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync; +import org.apache.flink.runtime.jobgraph.OperatorID; +import org.apache.flink.runtime.memory.MemoryManager; +import org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext; +import org.apache.flink.runtime.operators.coordination.OperatorEvent; +import org.apache.flink.runtime.operators.testutils.MockEnvironment; +import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder; +import org.apache.flink.streaming.api.graph.StreamConfig; +import org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway; +import org.apache.flink.streaming.runtime.tasks.StreamTask; +import org.apache.flink.streaming.util.MockStreamTaskBuilder; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; + +import java.util.HashMap; +import java.util.Map; +import java.util.NoSuchElementException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; + +/** + * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance for testing. + * + * @param Input type + */ +public class BulkInsertFunctionWrapper implements TestFunctionWrapper { + private final Configuration conf; + private final RowType rowType; + private final RowType rowTypeWithFileId; + + private final MockStreamingRuntimeContext runtimeContext; + private final MockOperatorEventGateway gateway; + private final MockOperatorCoordinatorContext coordinatorContext; + private final StreamWriteOperatorCoordinator coordinator; + private final MockStateInitializationContext stateInitializationContext; + + private final boolean asyncClustering; + private ClusteringFunctionWrapper clusteringFunctionWrapper; + + /** + * Bulk insert write function. + */ + private BulkInsertWriteFunction writeFunction; + private MapFunction mapFunction; + private Map bucketIdToFileId; + + public BulkInsertFunctionWrapper(String tablePath, Configuration conf) throws Exception { +IOManager ioManager = new IOManagerAsync(); +MockEnvironment environment = new MockEnvironmentBuilder() +.setTaskName("mockTask") +.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE) +.setIOManager(ioManager) +.build(); +this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, environment); +this.gateway = new MockOperatorEventGateway(); +this.conf = conf; +this.rowType = (RowType) AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType(); +this.rowTypeWithFileId = BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType); +// one function +this.coordinatorContext = new MockOperatorCoordinatorContext(new OperatorID(), 1); +this.coordinator = new StreamWriteOperatorCoordinator(conf, this.coordinatorContext); +this.stateInitializationContext = new MockStateInitializationContext(); +this.asyncClustering = OptionsResolver.needsAsyncClustering(conf); +StreamConfig streamConfi
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366622645 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } - public boolean isNonBlockingConcurrencyControl() { -return getTableType().equals(HoodieTableType.MERGE_ON_READ) -&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() -&& isSimpleBucketIndex(); + public boolean needResolveWriteConflict(WriteOperationType operationType) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && isSimpleBucketIndex()) { +return operationType == WriteOperationType.UNKNOWN || operationType == WriteOperationType.BULK_INSERT; + } else { +return true; Review Comment: `UPSERT` would return `false` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772257235 ## CI report: * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406) * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417) * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772245210 ## CI report: * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406) * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417) * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6946) Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata
[ https://issues.apache.org/jira/browse/HUDI-6946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1063#comment-1063 ] xi chaomin edited comment on HUDI-6946 at 10/20/23 7:38 AM: This is caused by the format of the value in record key and col stats . !WX20231019-094414.png! was (Author: xichaomin): This is caused by the format of the value in record key and col stats are different. !WX20231019-094414.png! > Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata > -- > > Key: HUDI-6946 > URL: https://issues.apache.org/jira/browse/HUDI-6946 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, writer-core >Affects Versions: 0.13.1, 0.12.3, 0.14.0 >Reporter: Aditya Goenka >Assignee: xi chaomin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.4, 0.14.1, 0.13.2 > > Attachments: WX20231019-094414.png > > > Github Issue - > [https://github.com/apache/hudi/issues/9870] > > Code to Reproduce - > ``` > from pyspark.sql.functions import col, concat, lit, max, min, substring, desc > COW_TABLE_NAME="table_duplicates" > PARTITION_FIELD = "year,month" > PRECOMBINE_FIELD = "timestamp" > COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4()) > hudi_options_opt = { > "hoodie.table.name": COW_TABLE_NAME, > "hoodie.table.type": "COPY_ON_WRITE", > "hoodie.index.type": "BLOOM", > "hoodie.datasource.write.recordkey.field": "id", > "hoodie.datasource.write.partitionpath.field": PARTITION_FIELD, > "hoodie.datasource.write.precombine.field": PRECOMBINE_FIELD, > "hoodie.datasource.write.hive_style_partitioning": "true", > "hoodie.metadata.enable": "true", > "hoodie.bloom.index.use.metadata": "true", > "hoodie.metadata.index.column.stats.enable": "true", > "hoodie.parquet.small.file.limit": -1 > } > COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4()) > inputDF = spark.createDataFrame( > [ > ('1', "1", '1',2020,1), > ('2', "1", '1',2020,1), > ('3', "1", '1',2020,1) > ], > ["id", "value", "timestamp","year","month"] > ) > (inputDF.write.format("org.apache.hudi") > .option("hoodie.datasource.write.operation", "upsert") > .options(**hudi_options_opt) > .mode("append") > .save(COW_TABLE_LOCATION)) > upsertDF = spark.createDataFrame( > [ > ('3', "2", '1',2020,1) > ], > ["id", "value", "timestamp","year","month"] > ) > (upsertDF.write.format("org.apache.hudi") > .option("hoodie.datasource.write.operation", "upsert") > .options(**hudi_options_opt) > .mode("append") > .save(COW_TABLE_LOCATION)) > spark.read.format('org.apache.hudi').load(COW_TABLE_LOCATION).groupBy("year","month","_hoodie_record_key").count().orderBy(desc("count")).show(100, > False) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366594355 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java: ## @@ -67,8 +68,9 @@ public static Option resolveWriteConflictIfAny( Option lastCompletedTxnOwnerInstant, boolean reloadActiveTimeline, Set pendingInstants) throws HoodieWriteConflictException { -// Skip to resolve conflict if using non-blocking concurrency control -if (config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() && !config.isNonBlockingConcurrencyControl()) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +WriteOperationType operationType = thisCommitMetadata.orElse(new HoodieCommitMetadata()).getOperationType(); +if (config.needResolveWriteConflict(operationType)) { Review Comment: Just passaround an `Option` should be fine? There is no need to new a `HoodieCommitMetadata()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366592051 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.utils; + +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsResolver; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.sink.StreamWriteOperatorCoordinator; +import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper; +import org.apache.hudi.sink.bulk.BulkInsertWriteFunction; +import org.apache.hudi.sink.bulk.RowDataKeyGen; +import org.apache.hudi.sink.event.WriteMetadataEvent; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.api.common.ExecutionConfig; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.runtime.io.disk.iomanager.IOManager; +import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync; +import org.apache.flink.runtime.jobgraph.OperatorID; +import org.apache.flink.runtime.memory.MemoryManager; +import org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext; +import org.apache.flink.runtime.operators.coordination.OperatorEvent; +import org.apache.flink.runtime.operators.testutils.MockEnvironment; +import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder; +import org.apache.flink.streaming.api.graph.StreamConfig; +import org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway; +import org.apache.flink.streaming.runtime.tasks.StreamTask; +import org.apache.flink.streaming.util.MockStreamTaskBuilder; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; + +import java.util.HashMap; +import java.util.Map; +import java.util.NoSuchElementException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; + +/** + * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance for testing. + * + * @param Input type + */ +public class BulkInsertFunctionWrapper implements TestFunctionWrapper { + private final Configuration conf; + private final RowType rowType; + private final RowType rowTypeWithFileId; + + private final MockStreamingRuntimeContext runtimeContext; + private final MockOperatorEventGateway gateway; + private final MockOperatorCoordinatorContext coordinatorContext; + private final StreamWriteOperatorCoordinator coordinator; + private final MockStateInitializationContext stateInitializationContext; + + private final boolean asyncClustering; + private ClusteringFunctionWrapper clusteringFunctionWrapper; + + /** + * Bulk insert write function. + */ + private BulkInsertWriteFunction writeFunction; + private MapFunction mapFunction; + private Map bucketIdToFileId; + + public BulkInsertFunctionWrapper(String tablePath, Configuration conf) throws Exception { +IOManager ioManager = new IOManagerAsync(); +MockEnvironment environment = new MockEnvironmentBuilder() +.setTaskName("mockTask") +.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE) +.setIOManager(ioManager) +.build(); +this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, environment); +this.gateway = new MockOperatorEventGateway(); +this.conf = conf; +this.rowType = (RowType) AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType(); +this.rowTypeWithFileId = BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType); +// one function +this.coordinatorContext = new MockOperatorCoordinatorContext(new OperatorID(), 1); +this.coordinator = new StreamWriteOperatorCoordinator(conf, this.coordinatorContext); +this.stateInitializationContext = new MockStateInitializationContext(); +this.asyncClustering = OptionsResolver.needsAsyncClustering(conf); +StreamConfig streamConfig
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366593166 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } - public boolean isNonBlockingConcurrencyControl() { -return getTableType().equals(HoodieTableType.MERGE_ON_READ) -&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() -&& isSimpleBucketIndex(); + public boolean needResolveWriteConflict(WriteOperationType operationType) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && isSimpleBucketIndex()) { +return operationType == WriteOperationType.UNKNOWN || operationType == WriteOperationType.BULK_INSERT; + } else { +return true; Review Comment: I'm confused, for UPSERT it returns true, and that means we need to resolve conflicts right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]
boneanxs commented on issue #9615: URL: https://github.com/apache/hudi/issues/9615#issuecomment-1772233812 @KnightChess Have you seen any logs like this in executor side when this issue happens? ```java 23/03/23 02:28:45 INFO HoodieMergeHandle: MaxMemoryPerPartitionMerge => 1073741824 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: another attempt succeeded 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: Stage finished 23/03/23 02:28:47 INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 0, Total size in bytes of MemoryBasedMap => 0, Number of entries in BitCaskDiskMap => 0, Size of file spilled to disk => 0 23/03/23 02:28:47 INFO HoodieMergeHandle: partitionPath:grass_region=test, fileId to be merged:d3ee8406-4011-44a4-8913-8be0349a6686-0 ``` From our side, we actually see hudi ignore this kill signal and continue writing(`Executor is trying to kill task`). So here is actually 2 issues 1. Should fail task immediately if task obtains kill signal 2. How to handle duplicate files if reconcile commits is finished while task still writing. I think if the 1) is handled correctly, we can add an extra clean step(to clean files if already written) in the task side? The extra clean step should solve most cases, but may not solve all cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
hudi-bot commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772233593 ## CI report: * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406) * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417) * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.blo… [hudi]
hudi-bot commented on PR #9886: URL: https://github.com/apache/hudi/pull/9886#issuecomment-1772233448 ## CI report: * c3e2763136ca7108adc32162d093abb58b143363 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20394) * 9456e57d5d87ce8e2db946c03b2075e2d45b6826 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20419) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366571723 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sink.utils; + +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsResolver; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.sink.StreamWriteOperatorCoordinator; +import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper; +import org.apache.hudi.sink.bulk.BulkInsertWriteFunction; +import org.apache.hudi.sink.bulk.RowDataKeyGen; +import org.apache.hudi.sink.event.WriteMetadataEvent; +import org.apache.hudi.util.AvroSchemaConverter; +import org.apache.hudi.util.StreamerUtil; + +import org.apache.flink.api.common.ExecutionConfig; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.runtime.io.disk.iomanager.IOManager; +import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync; +import org.apache.flink.runtime.jobgraph.OperatorID; +import org.apache.flink.runtime.memory.MemoryManager; +import org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext; +import org.apache.flink.runtime.operators.coordination.OperatorEvent; +import org.apache.flink.runtime.operators.testutils.MockEnvironment; +import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder; +import org.apache.flink.streaming.api.graph.StreamConfig; +import org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway; +import org.apache.flink.streaming.runtime.tasks.StreamTask; +import org.apache.flink.streaming.util.MockStreamTaskBuilder; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.types.logical.RowType; + +import java.util.HashMap; +import java.util.Map; +import java.util.NoSuchElementException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; + +/** + * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance for testing. + * + * @param Input type + */ +public class BulkInsertFunctionWrapper implements TestFunctionWrapper { + private final Configuration conf; + private final RowType rowType; + private final RowType rowTypeWithFileId; + + private final MockStreamingRuntimeContext runtimeContext; + private final MockOperatorEventGateway gateway; + private final MockOperatorCoordinatorContext coordinatorContext; + private final StreamWriteOperatorCoordinator coordinator; + private final MockStateInitializationContext stateInitializationContext; + + private final boolean asyncClustering; + private ClusteringFunctionWrapper clusteringFunctionWrapper; + + /** + * Bulk insert write function. + */ + private BulkInsertWriteFunction writeFunction; + private MapFunction mapFunction; + private Map bucketIdToFileId; + + public BulkInsertFunctionWrapper(String tablePath, Configuration conf) throws Exception { +IOManager ioManager = new IOManagerAsync(); +MockEnvironment environment = new MockEnvironmentBuilder() +.setTaskName("mockTask") +.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE) +.setIOManager(ioManager) +.build(); +this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, environment); +this.gateway = new MockOperatorEventGateway(); +this.conf = conf; +this.rowType = (RowType) AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType(); +this.rowTypeWithFileId = BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType); +// one function +this.coordinatorContext = new MockOperatorCoordinatorContext(new OperatorID(), 1); +this.coordinator = new StreamWriteOperatorCoordinator(conf, this.coordinatorContext); +this.stateInitializationContext = new MockStateInitializationContext(); +this.asyncClustering = OptionsResolver.needsAsyncClustering(conf); +StreamConfig streamConfi
Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1366565797 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } - public boolean isNonBlockingConcurrencyControl() { -return getTableType().equals(HoodieTableType.MERGE_ON_READ) -&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() -&& isSimpleBucketIndex(); + public boolean needResolveWriteConflict(WriteOperationType operationType) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && isSimpleBucketIndex()) { +return operationType == WriteOperationType.UNKNOWN || operationType == WriteOperationType.BULK_INSERT; + } else { +return true; Review Comment: > So for `UPSERT` we also needs to resolve conflicts? No, `UPSERT` does not need resolve conflict. > Can we detect the conflict in such scenario? The first case: t1.commit would fail to commit caused by confliction The first case: both t1 and t2 could commit successful. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } - public boolean isNonBlockingConcurrencyControl() { -return getTableType().equals(HoodieTableType.MERGE_ON_READ) -&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() -&& isSimpleBucketIndex(); + public boolean needResolveWriteConflict(WriteOperationType operationType) { +// Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && isSimpleBucketIndex()) { +return operationType == WriteOperationType.UNKNOWN || operationType == WriteOperationType.BULK_INSERT; + } else { +return true; Review Comment: > So for `UPSERT` we also needs to resolve conflicts? No, `UPSERT` does not need resolve conflict. > Can we detect the conflict in such scenario? The first case: t1.commit would fail to commit caused by confliction The first case: both t1 and t2 could commit successful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]
stream2000 commented on PR #9749: URL: https://github.com/apache/hudi/pull/9749#issuecomment-1772205039 > This new strategy looks better to me. I see that some multiwriter tests are failing in azure. The CI is green now~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org