Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9761:
URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773677812

   
   ## CI report:
   
   * d680ff4d9daad14fea165c1520ef906606e991a5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9761:
URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773646118

   
   ## CI report:
   
   * 60875bdf9343f5f48a862393ff5845063cf549d7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20322)
 
   * d680ff4d9daad14fea165c1520ef906606e991a5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9761:
URL: https://github.com/apache/hudi/pull/9761#issuecomment-1773642737

   
   ## CI report:
   
   * 60875bdf9343f5f48a862393ff5845063cf549d7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20322)
 
   * d680ff4d9daad14fea165c1520ef906606e991a5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1367634939


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -255,40 +255,47 @@ object DefaultSource {
 Option.empty
   }
 
-  (tableType, queryType, isBootstrappedTable) match {
-case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) |
- (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) |
- (MERGE_ON_READ, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) =>
+  val isMultipleBaseFileFormatsEnabled = 
metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled
+  (tableType, queryType, isBootstrappedTable, 
isMultipleBaseFileFormatsEnabled) match {
+case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) |
+ (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false, true) =>
+  new HoodieMultiFileFormatCOWRelation(sqlContext, metaClient, 
parameters, userSchema, globPaths)
+case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) |

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1367634339


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMultiFileFormatRelation.scala:
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.projectReader
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.HoodieMultiFileFormatRelation.{createPartitionedFile, 
inferFileFormat}
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, 
HoodieLogFile}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.execution.datasources.{FilePartition, 
PartitionedFile}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * Base split for all Hoodie multi-file format relations.
+ */
+case class HoodieMultiFileFormatSplit(baseFile: Option[PartitionedFile],
+  logFiles: List[HoodieLogFile]) extends 
HoodieFileSplit
+
+/**
+ * Base relation to handle table with multiple base file formats.
+ */
+abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: 
SQLContext,
+ override val metaClient: 
HoodieTableMetaClient,

Review Comment:
   Yes, we are using the file extension to figure out the base file format. 
However, the usual `BaseFileOnlyRelation` converts to `HadoopFsRelation`, which 
cannot be done for multiple file formats because the file format is not known 
at the time creating the relation. It is only known when we the relation is 
collecting file splits. Hence, a separate relation to handle multiple file 
formats. I will add this in scaladoc of the class. 
   Plus, I think a separate relation keeps the code clean and more maintainable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1367633031


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -516,6 +515,99 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*/
   def updatePrunedDataSchema(prunedSchema: StructType): Relation
 
+  protected def createBaseFileReaders(tableSchema: HoodieTableSchema,
+  requiredSchema: HoodieTableSchema,
+  requestedColumns: Array[String],
+  requiredFilters: Seq[Filter],
+  optionalFilters: Seq[Filter] = Seq.empty,
+  baseFileFormat: HoodieFileFormat = 
tableConfig.getBaseFileFormat): HoodieMergeOnReadBaseFileReaders = {
+val (partitionSchema, dataSchema, requiredDataSchema) =
+  tryPrunePartitionColumns(tableSchema, requiredSchema)
+
+val fullSchemaReader = createBaseFileReader(
+  spark = sqlContext.sparkSession,
+  partitionSchema = partitionSchema,
+  dataSchema = dataSchema,
+  requiredDataSchema = dataSchema,
+  // This file-reader is used to read base file records, subsequently 
merging them with the records
+  // stored in delta-log files. As such, we have to read _all_ records 
from the base file, while avoiding
+  // applying any filtering _before_ we complete combining them w/ 
delta-log records (to make sure that
+  // we combine them correctly);
+  // As such only required filters could be pushed-down to such reader
+  filters = requiredFilters,
+  options = optParams,
+  // NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+  //   to configure Parquet reader appropriately
+  hadoopConf = embedInternalSchema(new Configuration(conf), 
internalSchemaOpt),
+  baseFileFormat = baseFileFormat
+)
+
+val requiredSchemaReader = createBaseFileReader(
+  spark = sqlContext.sparkSession,
+  partitionSchema = partitionSchema,
+  dataSchema = dataSchema,
+  requiredDataSchema = requiredDataSchema,
+  // This file-reader is used to read base file records, subsequently 
merging them with the records
+  // stored in delta-log files. As such, we have to read _all_ records 
from the base file, while avoiding
+  // applying any filtering _before_ we complete combining them w/ 
delta-log records (to make sure that
+  // we combine them correctly);
+  // As such only required filters could be pushed-down to such reader
+  filters = requiredFilters,
+  options = optParams,
+  // NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+  //   to configure Parquet reader appropriately
+  hadoopConf = embedInternalSchema(new Configuration(conf), 
requiredDataSchema.internalSchema),
+  baseFileFormat = baseFileFormat
+)
+
+// Check whether fields required for merging were also requested to be 
fetched
+// by the query:
+//- In case they were, there's no optimization we could apply here (we 
will have
+//to fetch such fields)
+//- In case they were not, we will provide 2 separate file-readers
+//a) One which would be applied to file-groups w/ delta-logs 
(merging)
+//b) One which would be applied to file-groups w/ no delta-logs or
+//   in case query-mode is skipping merging
+val mandatoryColumns = 
mandatoryFields.map(HoodieAvroUtils.getRootLevelFieldName)
+if (mandatoryColumns.forall(requestedColumns.contains)) {
+  HoodieMergeOnReadBaseFileReaders(
+fullSchemaReader = fullSchemaReader,
+requiredSchemaReader = requiredSchemaReader,
+requiredSchemaReaderSkipMerging = requiredSchemaReader
+  )
+} else {
+  val prunedRequiredSchema = {
+val unusedMandatoryColumnNames = 
mandatoryColumns.filterNot(requestedColumns.contains)
+val prunedStructSchema =
+  StructType(requiredDataSchema.structTypeSchema.fields
+.filterNot(f => unusedMandatoryColumnNames.contains(f.name)))
+
+HoodieTableSchema(prunedStructSchema, 
convertToAvroSchema(prunedStructSchema, tableName).toString)
+  }
+
+  val requiredSchemaReaderSkipMerging = createBaseFileReader(
+spark = sqlContext.sparkSession,
+partitionSchema = partitionSchema,
+dataSchema = dataSchema,
+requiredDataSchema = prunedRequiredSchema,
+// This file-reader is only used in cases when no merging is 
performed, therefore it's safe to push
+// down these filters to the base file readers
+filters = requiredFilters ++ optionalFilters,
+options = optParams,
+// NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+//   to configure Parquet reader appropriately

Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1367623109


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java:
##
@@ -553,7 +554,12 @@ public static void writeDataAsBatch(
* Initializes a writing pipeline with given configuration.
*/
   public static TestFunctionWrapper getWritePipeline(String basePath, 
Configuration conf) throws Exception {
-if (OptionsResolver.isAppendMode(conf)) {
+if (OptionsResolver.isBulkInsertOperation(conf)) {
+  if (conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT)) {
+throw new UnsupportedOperationException("Harness test does not support 
bulk insert sort yet");

Review Comment:
   The sort is by default disabled ? Then we can remove this check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1367622937


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java:
##
@@ -305,6 +305,34 @@ public TestHarness checkpointComplete(long checkpointId) {
   return this;
 }
 
+/**
+ * Flush data using endInput. Asserts the commit would fail.
+ */
+public void commitAsBatchThrows(Class cause, String msg) {
+  // this triggers the data write and event send
+  this.pipeline.endInput();
+  final OperatorEvent nextEvent = this.pipeline.getNextEvent();
+  this.pipeline.getCoordinator().handleEventFromOperator(0, nextEvent);
+  assertTrue(this.pipeline.getCoordinatorContext().isJobFailed(), "Job 
should have been failed");
+  Throwable throwable = 
this.pipeline.getCoordinatorContext().getJobFailureReason().getCause();
+  assertThat(throwable, instanceOf(cause));
+  assertThat(throwable.getMessage(), containsString(msg));
+}
+
+/**
+ * Flush data using endInput. Asserts the commit would fail.
+ */
+public TestHarness commitAsBatch() {
+  // this triggers the data write and event send
+  this.pipeline.endInput();
+  final OperatorEvent nextEvent = this.pipeline.getNextEvent();
+  this.pipeline.getCoordinator().handleEventFromOperator(0, nextEvent);
+  if (this.pipeline.getCoordinatorContext().isJobFailed()) {
+
System.err.println(this.pipeline.getCoordinatorContext().getJobFailureReason().getCause());

Review Comment:
   We do not print in test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1367620324


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -387,7 +387,7 @@ private void addEventToBuffer(WriteMetadataEvent event) {
 
   private void startInstant() {
 // refresh the last txn metadata
-this.writeClient.preTxn(this.metaClient);
+
this.writeClient.preTxn(WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)),
 this.metaClient);
 // put the assignment in front of metadata generation,

Review Comment:
   operation type is included in the `TableState`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1367619653


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
-  public boolean isNonBlockingConcurrencyControl() {
-return getTableType().equals(HoodieTableType.MERGE_ON_READ)
-&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()
-&& isSimpleBucketIndex();
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && 
isSimpleBucketIndex()) {
+return operationType == WriteOperationType.UNKNOWN || operationType == 
WriteOperationType.BULK_INSERT;
+  } else {
+return true;

Review Comment:
   Can we simplify the logic as:
   
   ```java
   return is_nb_cc ? operation == BULK_INSERT : true;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6953] Optimizing hudi sink operators generation [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9878:
URL: https://github.com/apache/hudi/pull/9878#discussion_r1367616667


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java:
##
@@ -106,17 +106,26 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context 
context) {
   // bootstrap
   final DataStream hoodieRecordDataStream =
   Pipelines.bootstrap(conf, rowType, dataStream, context.isBounded(), 
overwrite);
+
   // write pipeline
   pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream);
-  // compaction
+
+  // insert cluster mode
+  if (OptionsResolver.isInsertClusterMode(conf)) {
+return Pipelines.clean(conf, pipeline);
+  }

Review Comment:
   Can you check the Travis test failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6965) Flink Quickstart Restructuring

2023-10-20 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6965.

Resolution: Fixed

Fixed via asf-site: 46445fa14aa4bfc9d364f228278753870e5c21dc

> Flink Quickstart Restructuring
> --
>
> Key: HUDI-6965
> URL: https://issues.apache.org/jira/browse/HUDI-6965
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch asf-site updated: [HUDI-6965][Docs] Flink Quickstart Revamp (#9897)

2023-10-20 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 46445fa14aa [HUDI-6965][Docs] Flink Quickstart Revamp (#9897)
46445fa14aa is described below

commit 46445fa14aa4bfc9d364f228278753870e5c21dc
Author: Aditya Goenka <63430370+ad1happy...@users.noreply.github.com>
AuthorDate: Sat Oct 21 08:32:18 2023 +0530

[HUDI-6965][Docs] Flink Quickstart Revamp (#9897)
---
 website/docs/flink-quick-start-guide.md | 313 ++--
 1 file changed, 176 insertions(+), 137 deletions(-)

diff --git a/website/docs/flink-quick-start-guide.md 
b/website/docs/flink-quick-start-guide.md
index acec412e41f..cfba29ad8db 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -7,20 +7,9 @@ import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
 This page introduces Flink-Hudi integration. We can feel the unique charm of 
how Flink brings in the power of streaming into Hudi.
-This guide helps you quickly start using Flink on Hudi, and learn different 
modes for reading/writing Hudi by Flink:
+This guide helps you quickly start using Flink on Hudi, and learn different 
modes for reading/writing Hudi by Flink.
 
-- **Quick Start** : Read [Quick Start](#quick-start) to get started quickly 
Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Global 
Configuration](/docs/next/flink_tuning#global-configurations), sets up through 
`$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through 
[Table Option](/docs/next/flink_tuning#table-options).
-- **Writing Data** : Flink supports different modes for writing, such as [CDC 
Ingestion](/docs/hoodie_streaming_ingestion#cdc-ingestion), [Bulk 
Insert](/docs/hoodie_streaming_ingestion#bulk-insert), [Index 
Bootstrap](/docs/hoodie_streaming_ingestion#index-bootstrap), [Changelog 
Mode](/docs/hoodie_streaming_ingestion#changelog-mode) and [Append 
Mode](/docs/hoodie_streaming_ingestion#append-mode).
-- **Querying Data** : Flink supports different modes for reading, such as 
[Streaming Query](/docs/querying_data#streaming-query) and [Incremental 
Query](/docs/querying_data#incremental-query).
-- **Tuning** : For write/read tasks, this guide gives some tuning suggestions, 
such as [Memory Optimization](/docs/next/flink_tuning#memory-optimization) and 
[Write Rate Limit](/docs/next/flink_tuning#write-rate-limit).
-- **Optimization**: Offline compaction is supported [Offline 
Compaction](/docs/compaction#flink-offline-compaction).
-- **Query Engines**: Besides Flink, many other engines are integrated: [Hive 
Query](/docs/syncing_metastore#flink-setup), [Presto 
Query](/docs/querying_data#prestodb).
-- **Catalog**: A Hudi specific catalog is supported: [Hudi 
Catalog](/docs/querying_data/#hudi-catalog).
-
-## Quick Start
-
-### Setup
+## Setup
 https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sqlclient/)
 because it's a good
 quick start tool for SQL users.
 
- Step.1 download Flink jar
-
-Hudi works with both Flink 1.13, Flink 1.14, Flink 1.15, Flink 1.16 and Flink 
1.17. You can follow the
-instructions [here](https://flink.apache.org/downloads) for setting up Flink. 
Then choose the desired Hudi-Flink bundle
-jar to work with different Flink and Scala versions:
-
-- `hudi-flink1.13-bundle`
-- `hudi-flink1.14-bundle`
-- `hudi-flink1.15-bundle`
-- `hudi-flink1.16-bundle`
-- `hudi-flink1.17-bundle`
+### Download Flink
 
- Step.2 start Flink cluster
-Start a standalone Flink cluster within hadoop environment.
-Before you start up the cluster, we suggest to config the cluster as follows:
+Hudi works with Flink 1.13, Flink 1.14, Flink 1.15, Flink 1.16 and Flink 1.17. 
You can follow the
+instructions [here](https://flink.apache.org/downloads) for setting up Flink.
 
-- in `$FLINK_HOME/conf/flink-conf.yaml`, add config option 
`taskmanager.numberOfTaskSlots: 4`
-- in `$FLINK_HOME/conf/flink-conf.yaml`, [add other global configurations 
according to the characteristics of your 
task](/docs/next/flink_tuning#global-configurations)
-- in `$FLINK_HOME/conf/workers`, add item `localhost` as 4 lines so that there 
are 4 workers on the local cluster
-
-Now starts the cluster:
+### Start Flink cluster
+Start a standalone Flink cluster within hadoop environment. In case we are 
trying on local setup, then we could download hadoop binaries and set 
HADOOP_HOME.
 
 ```bash
 # HADOOP_HOME is your hadoop root directory after unpack the binary package.
@@ -63,21 +38,34 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
 # Start the Flink standalone cluster
 ./bin/start-cluster.sh
 ```
- Step.3 start Flink SQL client
+### Start Flink SQL client
 
 Hudi supports packaged bundle jar for Flink, which should be loaded in the 
Flink SQL Client when 

[jira] [Created] (HUDI-6965) Flink Quickstart Restructuring

2023-10-20 Thread Danny Chen (Jira)
Danny Chen created HUDI-6965:


 Summary: Flink Quickstart Restructuring
 Key: HUDI-6965
 URL: https://issues.apache.org/jira/browse/HUDI-6965
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [Docs] Flink Quickstart Restructuring [hudi]

2023-10-20 Thread via GitHub


danny0405 merged PR #9897:
URL: https://github.com/apache/hudi/pull/9897


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p

2023-10-20 Thread via GitHub


danny0405 commented on issue #8614:
URL: https://github.com/apache/hudi/issues/8614#issuecomment-1773622757

   @ad1happy2go Would you mind to reproduce with the given cmd by 
@pushpavanthar ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1367610034


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -821,6 +820,8 @@ public static class PropertyBuilder {
 private String metadataPartitions;
 private String inflightMetadataPartitions;
 private String secondaryIndexesMetadata;
+private Boolean multipleBaseFileFormatsEnabled;
+private String baseFileFormats;

Review Comment:
   removed.. only keeping the boolean config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-20 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1367606971


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java:
##
@@ -866,11 +866,11 @@ public void validateInsertSchema() throws 
HoodieInsertException {
   }
 
   public HoodieFileFormat getBaseFileFormat() {
-return metaClient.getTableConfig().getBaseFileFormat();
-  }
-
-  public HoodieFileFormat getLogFileFormat() {
-return metaClient.getTableConfig().getLogFileFormat();
+HoodieTableConfig tableConfig = metaClient.getTableConfig();
+if (tableConfig.contains(HoodieTableConfig.BASE_FILE_FORMAT)) {
+  return metaClient.getTableConfig().getBaseFileFormat();
+}
+return config.getBaseFileFormat();

Review Comment:
   checking table config first for backward compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1773523545

   
   ## CI report:
   
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * c580c03a82bb25a639e0659b779cdf10acfaf33e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20425)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1773483151

   
   ## CI report:
   
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420)
 
   * c580c03a82bb25a639e0659b779cdf10acfaf33e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6961) Deletes with custom delete field not working with DefaultHoodieRecordPayload

2023-10-20 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6961:

Story Points: 5  (was: 3)

> Deletes with custom delete field not working with DefaultHoodieRecordPayload
> 
>
> Key: HUDI-6961
> URL: https://issues.apache.org/jira/browse/HUDI-6961
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When configuring custom delete key and delete marker with 
> DefaultHoodieRecordPayload, writing fails with the deletes in the batch:
> {code:java}
> Error for key:HoodieKey { recordKey=0 partitionPath=} is 
> java.util.NoSuchElementException: No value present in Option
>   at org.apache.hudi.common.util.Option.get(Option.java:89)
>   at 
> org.apache.hudi.common.model.HoodieAvroRecord.prependMetaFields(HoodieAvroRecord.java:132)
>   at 
> org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:144)
>   at 
> org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:180)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42)
>   at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:69)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>   at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773365413

   
   ## CI report:
   
   * 6e9970649492dd16f26a02d5842956d4212ec31d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20423)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6952) Skip reading the uncommitted log files for log reader

2023-10-20 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6952.

Resolution: Fixed

Fixed via master branch: c9c7c1019f5ca903ad3782f79370663f5ae26cb9

> Skip reading the uncommitted log files for log reader
> -
>
> Key: HUDI-6952
> URL: https://issues.apache.org/jira/browse/HUDI-6952
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6952] Skip reading the uncommitted log files for log reader (#9879)

2023-10-20 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c9c7c1019f5 [HUDI-6952] Skip reading the uncommitted log files for log 
reader (#9879)
c9c7c1019f5 is described below

commit c9c7c1019f5ca903ad3782f79370663f5ae26cb9
Author: Danny Chan 
AuthorDate: Sat Oct 21 04:21:14 2023 +0800

[HUDI-6952] Skip reading the uncommitted log files for log reader (#9879)

This is to avoid potential exceptions when the reader is processing an 
uncommitted log file while the
cleaning or rollback service removes the log file.
---
 .../hudi/client/TestCompactionAdminClient.java | 14 ++--
 .../table/log/AbstractHoodieLogRecordReader.java   | 24 ++-
 .../table/timeline/CompletionTimeQueryView.java| 19 +++--
 .../table/timeline/HoodieDefaultTimeline.java  | 57 +++
 .../table/view/AbstractTableFileSystemView.java| 84 +-
 .../table/view/TestHoodieTableFileSystemView.java  | 23 +++---
 6 files changed, 129 insertions(+), 92 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
index 4177297a6ba..20677bf7c85 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
@@ -20,7 +20,6 @@ package org.apache.hudi.client;
 
 import org.apache.hudi.client.CompactionAdminClient.ValidationOpResult;
 import org.apache.hudi.common.model.CompactionOperation;
-import org.apache.hudi.common.model.HoodieFileGroup;
 import org.apache.hudi.common.model.HoodieLogFile;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
@@ -208,7 +207,6 @@ public class TestCompactionAdminClient extends 
HoodieClientTestBase {
 final HoodieTableFileSystemView newFsView =
 new HoodieTableFileSystemView(metaClient, 
metaClient.getCommitsAndCompactionTimeline());
 Set commitsWithDataFile = CollectionUtils.createSet("000", "004");
-Set commitsWithLogAfterCompactionRequest = 
CollectionUtils.createSet("000", "002");
 // Expect each file-slice whose base-commit is same as compaction commit 
to contain no new Log files
 
newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 compactionInstant, true)
 .filter(fs -> fs.getBaseInstantTime().compareTo(compactionInstant) <= 
0)
@@ -218,16 +216,12 @@ public class TestCompactionAdminClient extends 
HoodieClientTestBase {
   } else {
 assertFalse(fs.getBaseFile().isPresent(), "No Data file should be 
present");
   }
-  if 
(commitsWithLogAfterCompactionRequest.contains(fs.getBaseInstantTime())) {
-assertEquals(4, fs.getLogFiles().count(), "Has Log Files");
-  } else {
-assertEquals(2, fs.getLogFiles().count(), "Has Log Files");
-  }
+  assertEquals(2, fs.getLogFiles().count(), "Has Log Files");
 });
 
 // Ensure same number of log-files before and after renaming per fileId
 Map fileIdToCountsAfterRenaming =
-
newFsView.getAllFileGroups(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0]).flatMap(HoodieFileGroup::getAllFileSlices)
+
newFsView.getLatestMergedFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 ingestionInstant)
 .filter(fs -> fs.getBaseInstantTime().equals(ingestionInstant))
 .map(fs -> Pair.of(fs.getFileId(), fs.getLogFiles().count()))
 .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
@@ -246,7 +240,7 @@ public class TestCompactionAdminClient extends 
HoodieClientTestBase {
 ensureValidCompactionPlan(compactionInstant);
 
 // Check suggested rename operations
-metaClient = 
HoodieTableMetaClient.builder().setConf(metaClient.getHadoopConf()).setBasePath(basePath).setLoadActiveTimelineOnLoad(true).build();
+metaClient.reloadActiveTimeline();
 
 // Log files belonging to file-slices created because of compaction 
request must be renamed
 
@@ -277,7 +271,7 @@ public class TestCompactionAdminClient extends 
HoodieClientTestBase {
 
 // Ensure same number of log-files before and after renaming per fileId
 Map fileIdToCountsAfterRenaming =
-
newFsView.getAllFileGroups(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0]).flatMap(HoodieFileGroup::getAllFileSlices)
+
newFsView.getLatestMergedFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 ingestionInstant)
 .filter(fs -> fs.getBaseInstantTime().equals(ingestionInstant))
 .filter(fs -> fs.getFileId().equals(op.getF

Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-20 Thread via GitHub


danny0405 merged PR #9879:
URL: https://github.com/apache/hudi/pull/9879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on PR #9879:
URL: https://github.com/apache/hudi/pull/9879#issuecomment-1773345831

   Test has passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20416&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773317660

   
   ## CI report:
   
   * c9d09be26e354d8fb4efdc51e1e1e9a6da7d2031 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20413)
 
   * 6e9970649492dd16f26a02d5842956d4212ec31d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20423)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Row writer optimization for bulk insert [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9852:
URL: https://github.com/apache/hudi/pull/9852#issuecomment-1773308489

   
   ## CI report:
   
   * c9d09be26e354d8fb4efdc51e1e1e9a6da7d2031 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20413)
 
   * 6e9970649492dd16f26a02d5842956d4212ec31d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy [hudi]

2023-10-20 Thread via GitHub


ksmou closed pull request #9126: [HUDI-6482] Supports new compaction strategy 
DayBasedAndBoundedIOCompactionStrategy
URL: https://github.com/apache/hudi/pull/9126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6565] Spark offline compaction add failed retry mechanism [hudi]

2023-10-20 Thread via GitHub


ksmou closed pull request #9229: [HUDI-6565] Spark offline compaction add 
failed retry mechanism
URL: https://github.com/apache/hudi/pull/9229


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Compaction error [hudi]

2023-10-20 Thread via GitHub


fearlsgroove commented on issue #9885:
URL: https://github.com/apache/hudi/issues/9885#issuecomment-1773063344

   Yeap i can read the data succcessfully. Schema shows the only decimal value 
as a (12,2) value which is what I expect. Is there a way to try to read 
whatever records in the logs due for compaction might be causing this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] too many s3 list when hoodie.metadata.enable=true [hudi]

2023-10-20 Thread via GitHub


njalan commented on issue #9751:
URL: https://github.com/apache/hudi/issues/9751#issuecomment-1773018752

   @ad1happy2go May I know any updates from you? If can't reduce object list , 
can we cache these metadatas on driver? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Remove /next reference from already published doc versions [hudi]

2023-10-20 Thread via GitHub


bhasudha merged PR #9893:
URL: https://github.com/apache/hudi/pull/9893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6964) Fix HoodieStreamer writing json value issue

2023-10-20 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-6964:
---
Summary: Fix HoodieStreamer writing json value issue  (was: Fix writing 
json value issue)

> Fix HoodieStreamer writing json value issue
> ---
>
> Key: HUDI-6964
> URL: https://issues.apache.org/jira/browse/HUDI-6964
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
> Fix For: 0.15.0
>
>
> when value is a json string, HoodieDeltaStreamer will write it as a map 
> string, which is inconsistent with the original data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6964) Fix writing json value issue

2023-10-20 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-6964:
---
Summary: Fix writing json value issue  (was: Fix writing json format issue)

> Fix writing json value issue
> 
>
> Key: HUDI-6964
> URL: https://issues.apache.org/jira/browse/HUDI-6964
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
> Fix For: 0.15.0
>
>
> when value is a json string, HoodieDeltaStreamer will write it as a map 
> string, which is inconsistent with the original data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6964) Fix writing json format issue

2023-10-20 Thread Xianghu Wang (Jira)
Xianghu Wang created HUDI-6964:
--

 Summary: Fix writing json format issue
 Key: HUDI-6964
 URL: https://issues.apache.org/jira/browse/HUDI-6964
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: Xianghu Wang
Assignee: Xianghu Wang
 Fix For: 0.15.0


when value is a json string, HoodieDeltaStreamer will write it as a map string, 
which is inconsistent with the original data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [Docs] Flink Quickstart Restructuring [hudi]

2023-10-20 Thread via GitHub


ad1happy2go opened a new pull request, #9897:
URL: https://github.com/apache/hudi/pull/9897

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772748273

   
   ## CI report:
   
   * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.blo… [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9886:
URL: https://github.com/apache/hudi/pull/9886#issuecomment-1772665871

   
   ## CI report:
   
   * 9456e57d5d87ce8e2db946c03b2075e2d45b6826 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20419)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] KryoException while using spark-hudi [hudi]

2023-10-20 Thread via GitHub


ad1happy2go commented on issue #9891:
URL: https://github.com/apache/hudi/issues/9891#issuecomment-1772606850

   @akshayakp97 Did you tried 0.12.3 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/p

2023-10-20 Thread via GitHub


pushpavanthar commented on issue #8614:
URL: https://github.com/apache/hudi/issues/8614#issuecomment-1772585088

   @danny0405 we are facing same issue in Hudi version 0.13.1 and spark version 
3.2.1 and 3.3.2. Below is the command we use to run, Same command used to work 
fine with Hudi 0.11.1. 
   `spark-submit --master yarn --packages 
org.apache.spark:spark-avro_2.12:3.2.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.13.1,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.1,org.apache.hudi:hudi-aws-bundle:0.13.1
 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf 
spark.executor.cores=5 --conf spark.driver.memory=3200m --conf 
spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1400m 
--conf spark.executor.memory=14600m --conf spark.dynamicAllocation.enabled=true 
--conf spark.dynamicAllocation.initialExecutors=1 --conf 
spark.dynamicAllocation.minExecutors=1 --conf 
spark.dynamicAllocation.maxExecutors=21 --conf spark.scheduler.mode=FAIR --conf 
spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.shuffle.service.enabled=true --conf 
spark.sql.hive.convertMetastoreParquet=false --conf 
spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true
  --conf spark.executor.userClassPathFirst=true --conf 
spark.sql.catalogImplementation=hive --deploy-mode client 
s3://bucket_name/custom_jar-2.0.jar --hoodie-conf 
hoodie.parquet.compression.codec=snappy --hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.num_instants=100 --table-type 
COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.HoodieIncrSource 
--hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.path=s3://bucket_name/ml_attributes/features
 --hoodie-conf hoodie.metrics.on=true --hoodie-conf 
hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf 
hoodie.metrics.pushgateway.host=pushgateway.in --hoodie-conf 
hoodie.metrics.pushgateway.port=443 --hoodie-conf 
hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf 
hoodie.metrics.pushgateway.job.name=hudi_transformed_features_accounts_hudi 
--hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false 
--hoodie-conf hoodie.metadata.enable=true --hoodie-conf 
hoodie.metrics.reporter.metricsname.pr
 efix=hudi --target-base-path s3://bucket_name_transformed/features_accounts 
--target-table features_accounts --enable-sync --hoodie-conf 
hoodie.datasource.hive_sync.database=hudi_transformed --hoodie-conf 
hoodie.datasource.hive_sync.table=features_accounts --sync-tool-classes 
org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool --hoodie-conf 
hoodie.datasource.write.recordkey.field=id,pos --hoodie-conf 
hoodie.datasource.write.precombine.field=id --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 --hoodie-conf hoodie.datasource.write.partitionpath.field=created_at_dt 
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=created_at_dt 
--hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
 --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
--hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd'T'HH:mm:ss.SS'Z',
 -MM-dd' 'HH:mm:ss.SS,-MM-dd' 'HH:mm:ss,-MM-dd'T'HH:mm:ss'Z'" 
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd 
--source-ordering-field id --hoodie-conf secret.key.name=some-secret 
--hoodie-conf transformer.decrypt.cols=features_json --hoodie-conf 
transformer.uncompress.cols=false --hoodie-conf 
transformer.jsonToStruct.column=features_json --hoodie-conf 
transformer.normalize.column=features_json.accounts --hoodie-conf 
transformer.copy.fields=created_at,created_at_dt --transformer-class 
com.custom.transform.DecryptTransformer,com.custom.transform.JsonToStructTypeTransformer,com.custom.transform.NormalizeArrayTransformer,com.custom.transform.FlatteningTransformer,com.custom.transform.CopyFieldTransformer`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772576148

   
   ## CI report:
   
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421)
 
   * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772497032

   
   ## CI report:
   
   * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415)
 
   * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418)
 
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421)
 
   * 72489cb31221dfb8152f683cd9140b401134c9ee Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20422)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772496955

   
   ## CI report:
   
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]

2023-10-20 Thread via GitHub


KnightChess commented on issue #9615:
URL: https://github.com/apache/hudi/issues/9615#issuecomment-1772439298

   > 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: another attempt succeeded
   > 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: Stage finished
   @boneanxs yes, has these log `try to kill`. there has two issues. the 
iterator in compact is uninterruptible, can not kill immediately, but it 
doesn't completely solve the problem if fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772421485

   
   ## CI report:
   
   * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415)
 
   * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418)
 
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421)
 
   * 72489cb31221dfb8152f683cd9140b401134c9ee UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772408526

   
   ## CI report:
   
   * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415)
 
   * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418)
 
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6952] Skip reading the uncommitted log files for log reader [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9879:
URL: https://github.com/apache/hudi/pull/9879#issuecomment-1772408300

   
   ## CI report:
   
   * 94301a15c9f75355c0ebf5bab3baf6226820ac42 UNKNOWN
   * ebc4df431089161ff933f125829bc7975bb04509 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20416)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772344024

   
   ## CI report:
   
   * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415)
 
   * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418)
 
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20421)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9896:
URL: https://github.com/apache/hudi/pull/9896#issuecomment-1772331421

   
   ## CI report:
   
   * f4fa15f3e401363be00607940dd5f8b6a4af3215 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20415)
 
   * 309382d4ac6cc4dfd22d614417717a2f287b6bdb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20418)
 
   * 7a44dd88d5a5c72d3b54fb66412b82113e803577 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9895:
URL: https://github.com/apache/hudi/pull/9895#issuecomment-1772317834

   
   ## CI report:
   
   * 6eb38c5b78e4f53a7d21c15f5bb63cd528bc5132 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20414)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772317758

   
   ## CI report:
   
   * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417)
 
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


beyond1920 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366628134


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java:
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.utils;
+
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
+import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper;
+import org.apache.hudi.sink.bulk.BulkInsertWriteFunction;
+import org.apache.hudi.sink.bulk.RowDataKeyGen;
+import org.apache.hudi.sink.event.WriteMetadataEvent;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.api.common.ExecutionConfig;
+import org.apache.flink.api.common.functions.MapFunction;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync;
+import org.apache.flink.runtime.jobgraph.OperatorID;
+import org.apache.flink.runtime.memory.MemoryManager;
+import 
org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext;
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+import org.apache.flink.runtime.operators.testutils.MockEnvironment;
+import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder;
+import org.apache.flink.streaming.api.graph.StreamConfig;
+import 
org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway;
+import org.apache.flink.streaming.runtime.tasks.StreamTask;
+import org.apache.flink.streaming.util.MockStreamTaskBuilder;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.NoSuchElementException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance 
for testing.
+ *
+ * @param  Input type
+ */
+public class BulkInsertFunctionWrapper implements TestFunctionWrapper {
+  private final Configuration conf;
+  private final RowType rowType;
+  private final RowType rowTypeWithFileId;
+
+  private final MockStreamingRuntimeContext runtimeContext;
+  private final MockOperatorEventGateway gateway;
+  private final MockOperatorCoordinatorContext coordinatorContext;
+  private final StreamWriteOperatorCoordinator coordinator;
+  private final MockStateInitializationContext stateInitializationContext;
+
+  private final boolean asyncClustering;
+  private ClusteringFunctionWrapper clusteringFunctionWrapper;
+
+  /**
+   * Bulk insert write function.
+   */
+  private BulkInsertWriteFunction writeFunction;
+  private MapFunction mapFunction;
+  private Map bucketIdToFileId;
+
+  public BulkInsertFunctionWrapper(String tablePath, Configuration conf) 
throws Exception {
+IOManager ioManager = new IOManagerAsync();
+MockEnvironment environment = new MockEnvironmentBuilder()
+.setTaskName("mockTask")
+.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE)
+.setIOManager(ioManager)
+.build();
+this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, 
environment);
+this.gateway = new MockOperatorEventGateway();
+this.conf = conf;
+this.rowType = (RowType) 
AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType();
+this.rowTypeWithFileId = 
BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType);
+// one function
+this.coordinatorContext = new MockOperatorCoordinatorContext(new 
OperatorID(), 1);
+this.coordinator = new StreamWriteOperatorCoordinator(conf, 
this.coordinatorContext);
+this.stateInitializationContext = new MockStateInitializationContext();
+this.asyncClustering = OptionsResolver.needsAsyncClustering(conf);
+StreamConfig streamConfi

Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


beyond1920 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366622645


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
-  public boolean isNonBlockingConcurrencyControl() {
-return getTableType().equals(HoodieTableType.MERGE_ON_READ)
-&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()
-&& isSimpleBucketIndex();
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && 
isSimpleBucketIndex()) {
+return operationType == WriteOperationType.UNKNOWN || operationType == 
WriteOperationType.BULK_INSERT;
+  } else {
+return true;

Review Comment:
   `UPSERT` would return `false`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772257235

   
   ## CI report:
   
   * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406)
 
   * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417)
 
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20420)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772245210

   
   ## CI report:
   
   * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406)
 
   * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417)
 
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   * f8cee7222d7e2b9aa50d173d8d5d24b07c062d8e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-6946) Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata

2023-10-20 Thread xi chaomin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1063#comment-1063
 ] 

xi chaomin edited comment on HUDI-6946 at 10/20/23 7:38 AM:


This is caused by the format of the value in record key and col stats .

!WX20231019-094414.png!


was (Author: xichaomin):
This is caused by the format of the value in record key and col stats are 
different.

!WX20231019-094414.png!

> Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata
> --
>
> Key: HUDI-6946
> URL: https://issues.apache.org/jira/browse/HUDI-6946
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, writer-core
>Affects Versions: 0.13.1, 0.12.3, 0.14.0
>Reporter: Aditya Goenka
>Assignee: xi chaomin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.4, 0.14.1, 0.13.2
>
> Attachments: WX20231019-094414.png
>
>
> Github Issue - 
> [https://github.com/apache/hudi/issues/9870]
>  
> Code to Reproduce - 
> ```
> from pyspark.sql.functions import col, concat, lit, max, min, substring, desc
> COW_TABLE_NAME="table_duplicates"
> PARTITION_FIELD = "year,month"
> PRECOMBINE_FIELD = "timestamp"
> COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4())
> hudi_options_opt = {
> "hoodie.table.name": COW_TABLE_NAME,
> "hoodie.table.type": "COPY_ON_WRITE",
> "hoodie.index.type": "BLOOM",
> "hoodie.datasource.write.recordkey.field": "id",
> "hoodie.datasource.write.partitionpath.field": PARTITION_FIELD,
> "hoodie.datasource.write.precombine.field": PRECOMBINE_FIELD,
> "hoodie.datasource.write.hive_style_partitioning": "true",
> "hoodie.metadata.enable": "true",
> "hoodie.bloom.index.use.metadata": "true",
> "hoodie.metadata.index.column.stats.enable": "true",
> "hoodie.parquet.small.file.limit": -1
> }
> COW_TABLE_LOCATION="file:///tmp/issue_9870_" + str(uuid.uuid4())
> inputDF = spark.createDataFrame(
> [
> ('1', "1", '1',2020,1),
> ('2', "1", '1',2020,1),
> ('3', "1", '1',2020,1)
> ],
> ["id", "value", "timestamp","year","month"]
> )
> (inputDF.write.format("org.apache.hudi")
> .option("hoodie.datasource.write.operation", "upsert")
> .options(**hudi_options_opt)
> .mode("append")
> .save(COW_TABLE_LOCATION))
> upsertDF = spark.createDataFrame(
> [
> ('3', "2", '1',2020,1)
> ],
> ["id", "value", "timestamp","year","month"]
> )
> (upsertDF.write.format("org.apache.hudi")
> .option("hoodie.datasource.write.operation", "upsert")
> .options(**hudi_options_opt)
> .mode("append")
> .save(COW_TABLE_LOCATION))
> spark.read.format('org.apache.hudi').load(COW_TABLE_LOCATION).groupBy("year","month","_hoodie_record_key").count().orderBy(desc("count")).show(100,
>  False)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366594355


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/TransactionUtils.java:
##
@@ -67,8 +68,9 @@ public static Option 
resolveWriteConflictIfAny(
   Option lastCompletedTxnOwnerInstant,
   boolean reloadActiveTimeline,
   Set pendingInstants) throws HoodieWriteConflictException {
-// Skip to resolve conflict if using non-blocking concurrency control
-if 
(config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl() && 
!config.isNonBlockingConcurrencyControl()) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+WriteOperationType operationType = thisCommitMetadata.orElse(new 
HoodieCommitMetadata()).getOperationType();
+if (config.needResolveWriteConflict(operationType)) {

Review Comment:
   Just passaround an `Option` should be fine? There is no need to new a 
`HoodieCommitMetadata()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366592051


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java:
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.utils;
+
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
+import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper;
+import org.apache.hudi.sink.bulk.BulkInsertWriteFunction;
+import org.apache.hudi.sink.bulk.RowDataKeyGen;
+import org.apache.hudi.sink.event.WriteMetadataEvent;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.api.common.ExecutionConfig;
+import org.apache.flink.api.common.functions.MapFunction;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync;
+import org.apache.flink.runtime.jobgraph.OperatorID;
+import org.apache.flink.runtime.memory.MemoryManager;
+import 
org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext;
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+import org.apache.flink.runtime.operators.testutils.MockEnvironment;
+import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder;
+import org.apache.flink.streaming.api.graph.StreamConfig;
+import 
org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway;
+import org.apache.flink.streaming.runtime.tasks.StreamTask;
+import org.apache.flink.streaming.util.MockStreamTaskBuilder;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.NoSuchElementException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance 
for testing.
+ *
+ * @param  Input type
+ */
+public class BulkInsertFunctionWrapper implements TestFunctionWrapper {
+  private final Configuration conf;
+  private final RowType rowType;
+  private final RowType rowTypeWithFileId;
+
+  private final MockStreamingRuntimeContext runtimeContext;
+  private final MockOperatorEventGateway gateway;
+  private final MockOperatorCoordinatorContext coordinatorContext;
+  private final StreamWriteOperatorCoordinator coordinator;
+  private final MockStateInitializationContext stateInitializationContext;
+
+  private final boolean asyncClustering;
+  private ClusteringFunctionWrapper clusteringFunctionWrapper;
+
+  /**
+   * Bulk insert write function.
+   */
+  private BulkInsertWriteFunction writeFunction;
+  private MapFunction mapFunction;
+  private Map bucketIdToFileId;
+
+  public BulkInsertFunctionWrapper(String tablePath, Configuration conf) 
throws Exception {
+IOManager ioManager = new IOManagerAsync();
+MockEnvironment environment = new MockEnvironmentBuilder()
+.setTaskName("mockTask")
+.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE)
+.setIOManager(ioManager)
+.build();
+this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, 
environment);
+this.gateway = new MockOperatorEventGateway();
+this.conf = conf;
+this.rowType = (RowType) 
AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType();
+this.rowTypeWithFileId = 
BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType);
+// one function
+this.coordinatorContext = new MockOperatorCoordinatorContext(new 
OperatorID(), 1);
+this.coordinator = new StreamWriteOperatorCoordinator(conf, 
this.coordinatorContext);
+this.stateInitializationContext = new MockStateInitializationContext();
+this.asyncClustering = OptionsResolver.needsAsyncClustering(conf);
+StreamConfig streamConfig

Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


danny0405 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366593166


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
-  public boolean isNonBlockingConcurrencyControl() {
-return getTableType().equals(HoodieTableType.MERGE_ON_READ)
-&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()
-&& isSimpleBucketIndex();
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && 
isSimpleBucketIndex()) {
+return operationType == WriteOperationType.UNKNOWN || operationType == 
WriteOperationType.BULK_INSERT;
+  } else {
+return true;

Review Comment:
   I'm confused, for UPSERT it returns true, and that means we need to resolve 
conflicts right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated [hudi]

2023-10-20 Thread via GitHub


boneanxs commented on issue #9615:
URL: https://github.com/apache/hudi/issues/9615#issuecomment-1772233812

   @KnightChess Have you seen any logs like this in executor side when this 
issue happens?
   
   ```java
   23/03/23 02:28:45 INFO HoodieMergeHandle: MaxMemoryPerPartitionMerge => 
1073741824
   23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: another attempt succeeded
   23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: Stage finished
   23/03/23 02:28:47 INFO HoodieMergeHandle: Number of entries in 
MemoryBasedMap => 0, Total size in bytes of MemoryBasedMap => 0, Number of 
entries in BitCaskDiskMap => 0, Size of file spilled to disk => 0
   23/03/23 02:28:47 INFO HoodieMergeHandle: partitionPath:grass_region=test, 
fileId to be merged:d3ee8406-4011-44a4-8913-8be0349a6686-0
   ```
   
   From our side, we actually see hudi ignore this kill signal and continue 
writing(`Executor is trying to kill task`). So here is actually 2 issues
   1. Should fail task immediately if task obtains kill signal
   2. How to handle duplicate files if reconcile commits is finished while task 
still writing.
   
   I think if the 1) is handled correctly, we can add an extra clean step(to 
clean files if already written) in the task side? The extra clean step should 
solve most cases, but may not solve all cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9892:
URL: https://github.com/apache/hudi/pull/9892#issuecomment-1772233593

   
   ## CI report:
   
   * d2fd9c42f3994b5b5ba49e9c160f91add1f4aa96 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20406)
 
   * 721f4ef35c449190b9729a72f03c21cc81a3e0d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20417)
 
   * 8a3d6d6da7b0df80240d8621c4de9469a7efb1cf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.blo… [hudi]

2023-10-20 Thread via GitHub


hudi-bot commented on PR #9886:
URL: https://github.com/apache/hudi/pull/9886#issuecomment-1772233448

   
   ## CI report:
   
   * c3e2763136ca7108adc32162d093abb58b143363 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20394)
 
   * 9456e57d5d87ce8e2db946c03b2075e2d45b6826 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20419)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


beyond1920 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366571723


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/BulkInsertFunctionWrapper.java:
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.utils;
+
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.sink.StreamWriteOperatorCoordinator;
+import org.apache.hudi.sink.bucket.BucketBulkInsertWriterHelper;
+import org.apache.hudi.sink.bulk.BulkInsertWriteFunction;
+import org.apache.hudi.sink.bulk.RowDataKeyGen;
+import org.apache.hudi.sink.event.WriteMetadataEvent;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.api.common.ExecutionConfig;
+import org.apache.flink.api.common.functions.MapFunction;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.runtime.io.disk.iomanager.IOManager;
+import org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync;
+import org.apache.flink.runtime.jobgraph.OperatorID;
+import org.apache.flink.runtime.memory.MemoryManager;
+import 
org.apache.flink.runtime.operators.coordination.MockOperatorCoordinatorContext;
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+import org.apache.flink.runtime.operators.testutils.MockEnvironment;
+import org.apache.flink.runtime.operators.testutils.MockEnvironmentBuilder;
+import org.apache.flink.streaming.api.graph.StreamConfig;
+import 
org.apache.flink.streaming.api.operators.collect.utils.MockOperatorEventGateway;
+import org.apache.flink.streaming.runtime.tasks.StreamTask;
+import org.apache.flink.streaming.util.MockStreamTaskBuilder;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.NoSuchElementException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * A wrapper class to manipulate the {@link BulkInsertWriteFunction} instance 
for testing.
+ *
+ * @param  Input type
+ */
+public class BulkInsertFunctionWrapper implements TestFunctionWrapper {
+  private final Configuration conf;
+  private final RowType rowType;
+  private final RowType rowTypeWithFileId;
+
+  private final MockStreamingRuntimeContext runtimeContext;
+  private final MockOperatorEventGateway gateway;
+  private final MockOperatorCoordinatorContext coordinatorContext;
+  private final StreamWriteOperatorCoordinator coordinator;
+  private final MockStateInitializationContext stateInitializationContext;
+
+  private final boolean asyncClustering;
+  private ClusteringFunctionWrapper clusteringFunctionWrapper;
+
+  /**
+   * Bulk insert write function.
+   */
+  private BulkInsertWriteFunction writeFunction;
+  private MapFunction mapFunction;
+  private Map bucketIdToFileId;
+
+  public BulkInsertFunctionWrapper(String tablePath, Configuration conf) 
throws Exception {
+IOManager ioManager = new IOManagerAsync();
+MockEnvironment environment = new MockEnvironmentBuilder()
+.setTaskName("mockTask")
+.setManagedMemorySize(4 * MemoryManager.DEFAULT_PAGE_SIZE)
+.setIOManager(ioManager)
+.build();
+this.runtimeContext = new MockStreamingRuntimeContext(false, 1, 0, 
environment);
+this.gateway = new MockOperatorEventGateway();
+this.conf = conf;
+this.rowType = (RowType) 
AvroSchemaConverter.convertToDataType(StreamerUtil.getSourceSchema(conf)).getLogicalType();
+this.rowTypeWithFileId = 
BucketBulkInsertWriterHelper.rowTypeWithFileId(rowType);
+// one function
+this.coordinatorContext = new MockOperatorCoordinatorContext(new 
OperatorID(), 1);
+this.coordinator = new StreamWriteOperatorCoordinator(conf, 
this.coordinatorContext);
+this.stateInitializationContext = new MockStateInitializationContext();
+this.asyncClustering = OptionsResolver.needsAsyncClustering(conf);
+StreamConfig streamConfi

Re: [PR] [HUDI-6962] Correct the behavior of bulk insert for NB-CC [hudi]

2023-10-20 Thread via GitHub


beyond1920 commented on code in PR #9896:
URL: https://github.com/apache/hudi/pull/9896#discussion_r1366565797


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
-  public boolean isNonBlockingConcurrencyControl() {
-return getTableType().equals(HoodieTableType.MERGE_ON_READ)
-&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()
-&& isSimpleBucketIndex();
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && 
isSimpleBucketIndex()) {
+return operationType == WriteOperationType.UNKNOWN || operationType == 
WriteOperationType.BULK_INSERT;
+  } else {
+return true;

Review Comment:
   > So for `UPSERT` we also needs to resolve conflicts?
   No, `UPSERT` does not need resolve conflict.
   
   > Can we detect the conflict in such scenario?
   The first case: t1.commit would fail to commit caused by confliction
   The first case: both t1 and t2 could commit successful.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -2616,10 +2617,17 @@ public Integer getWritesFileIdEncoding() {
 return props.getInteger(WRITES_FILEID_ENCODING, 
HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID);
   }
 
-  public boolean isNonBlockingConcurrencyControl() {
-return getTableType().equals(HoodieTableType.MERGE_ON_READ)
-&& getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()
-&& isSimpleBucketIndex();
+  public boolean needResolveWriteConflict(WriteOperationType operationType) {
+// Skip to resolve conflict for non bulk_insert operation if using 
non-blocking concurrency control
+if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  if (getTableType().equals(HoodieTableType.MERGE_ON_READ) && 
isSimpleBucketIndex()) {
+return operationType == WriteOperationType.UNKNOWN || operationType == 
WriteOperationType.BULK_INSERT;
+  } else {
+return true;

Review Comment:
   > So for `UPSERT` we also needs to resolve conflicts?
   
   No, `UPSERT` does not need resolve conflict.
   
   > Can we detect the conflict in such scenario?
   
   The first case: t1.commit would fail to commit caused by confliction
   The first case: both t1 and t2 could commit successful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6878] Fix Overwrite error when ingest multiple tables [hudi]

2023-10-20 Thread via GitHub


stream2000 commented on PR #9749:
URL: https://github.com/apache/hudi/pull/9749#issuecomment-1772205039

   > This new strategy looks better to me. I see that some multiwriter tests 
are failing in azure.
   
   The CI is green now~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org