Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758945698

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 85bf27abe36ef2a6500ed323e64d6598649c95c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20298)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758932211

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758889101

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294)
 
   * 85bf27abe36ef2a6500ed323e64d6598649c95c2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20298)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1356029231


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMultiFileFormatRelation.scala:
##
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.projectReader
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.HoodieMultiFileFormatRelation.{createPartitionedFile, 
inferFileFormat}
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, 
HoodieLogFile}
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.execution.datasources.{FilePartition, 
PartitionedFile}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * Base split for all Hoodie multi-file format relations.
+ */
+case class HoodieMultiFileFormatSplit(baseFile: Option[PartitionedFile],
+  logFiles: List[HoodieLogFile]) extends 
HoodieFileSplit
+
+/**
+ * Base relation to handle table with multiple base file formats.
+ */
+abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: 
SQLContext,
+ override val metaClient: 
HoodieTableMetaClient,

Review Comment:
   What is the reason we need a new relation abstraction here? The base file 
format can be always inferred from the file extension right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1356027186


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -516,6 +515,99 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
*/
   def updatePrunedDataSchema(prunedSchema: StructType): Relation
 
+  protected def createBaseFileReaders(tableSchema: HoodieTableSchema,
+  requiredSchema: HoodieTableSchema,
+  requestedColumns: Array[String],
+  requiredFilters: Seq[Filter],
+  optionalFilters: Seq[Filter] = Seq.empty,
+  baseFileFormat: HoodieFileFormat = 
tableConfig.getBaseFileFormat): HoodieMergeOnReadBaseFileReaders = {
+val (partitionSchema, dataSchema, requiredDataSchema) =
+  tryPrunePartitionColumns(tableSchema, requiredSchema)
+
+val fullSchemaReader = createBaseFileReader(
+  spark = sqlContext.sparkSession,
+  partitionSchema = partitionSchema,
+  dataSchema = dataSchema,
+  requiredDataSchema = dataSchema,
+  // This file-reader is used to read base file records, subsequently 
merging them with the records
+  // stored in delta-log files. As such, we have to read _all_ records 
from the base file, while avoiding
+  // applying any filtering _before_ we complete combining them w/ 
delta-log records (to make sure that
+  // we combine them correctly);
+  // As such only required filters could be pushed-down to such reader
+  filters = requiredFilters,
+  options = optParams,
+  // NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+  //   to configure Parquet reader appropriately
+  hadoopConf = embedInternalSchema(new Configuration(conf), 
internalSchemaOpt),
+  baseFileFormat = baseFileFormat
+)
+
+val requiredSchemaReader = createBaseFileReader(
+  spark = sqlContext.sparkSession,
+  partitionSchema = partitionSchema,
+  dataSchema = dataSchema,
+  requiredDataSchema = requiredDataSchema,
+  // This file-reader is used to read base file records, subsequently 
merging them with the records
+  // stored in delta-log files. As such, we have to read _all_ records 
from the base file, while avoiding
+  // applying any filtering _before_ we complete combining them w/ 
delta-log records (to make sure that
+  // we combine them correctly);
+  // As such only required filters could be pushed-down to such reader
+  filters = requiredFilters,
+  options = optParams,
+  // NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+  //   to configure Parquet reader appropriately
+  hadoopConf = embedInternalSchema(new Configuration(conf), 
requiredDataSchema.internalSchema),
+  baseFileFormat = baseFileFormat
+)
+
+// Check whether fields required for merging were also requested to be 
fetched
+// by the query:
+//- In case they were, there's no optimization we could apply here (we 
will have
+//to fetch such fields)
+//- In case they were not, we will provide 2 separate file-readers
+//a) One which would be applied to file-groups w/ delta-logs 
(merging)
+//b) One which would be applied to file-groups w/ no delta-logs or
+//   in case query-mode is skipping merging
+val mandatoryColumns = 
mandatoryFields.map(HoodieAvroUtils.getRootLevelFieldName)
+if (mandatoryColumns.forall(requestedColumns.contains)) {
+  HoodieMergeOnReadBaseFileReaders(
+fullSchemaReader = fullSchemaReader,
+requiredSchemaReader = requiredSchemaReader,
+requiredSchemaReaderSkipMerging = requiredSchemaReader
+  )
+} else {
+  val prunedRequiredSchema = {
+val unusedMandatoryColumnNames = 
mandatoryColumns.filterNot(requestedColumns.contains)
+val prunedStructSchema =
+  StructType(requiredDataSchema.structTypeSchema.fields
+.filterNot(f => unusedMandatoryColumnNames.contains(f.name)))
+
+HoodieTableSchema(prunedStructSchema, 
convertToAvroSchema(prunedStructSchema, tableName).toString)
+  }
+
+  val requiredSchemaReaderSkipMerging = createBaseFileReader(
+spark = sqlContext.sparkSession,
+partitionSchema = partitionSchema,
+dataSchema = dataSchema,
+requiredDataSchema = prunedRequiredSchema,
+// This file-reader is only used in cases when no merging is 
performed, therefore it's safe to push
+// down these filters to the base file readers
+filters = requiredFilters ++ optionalFilters,
+options = optParams,
+// NOTE: We have to fork the Hadoop Config here as Spark will be 
modifying it
+//   to configure Parquet reader 

Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1356025835


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -255,40 +255,47 @@ object DefaultSource {
 Option.empty
   }
 
-  (tableType, queryType, isBootstrappedTable) match {
-case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) |
- (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) |
- (MERGE_ON_READ, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) =>
+  val isMultipleBaseFileFormatsEnabled = 
metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled
+  (tableType, queryType, isBootstrappedTable, 
isMultipleBaseFileFormatsEnabled) match {
+case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) |
+ (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false, true) =>
+  new HoodieMultiFileFormatCOWRelation(sqlContext, metaClient, 
parameters, userSchema, globPaths)
+case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) |

Review Comment:
   Can we handle the `isMultipleBaseFileFormatsEnabled` serarately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1356023744


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -821,6 +820,8 @@ public static class PropertyBuilder {
 private String metadataPartitions;
 private String inflightMetadataPartitions;
 private String secondaryIndexesMetadata;
+private Boolean multipleBaseFileFormatsEnabled;
+private String baseFileFormats;

Review Comment:
   Can you explain in high level why we need this config `baseFileFormats` ? 
And why it must a table config here, actually we can merge these two variables 
into one, for exmaple, an empty string of base formats represent `disabled`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1356020748


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java:
##
@@ -866,11 +866,11 @@ public void validateInsertSchema() throws 
HoodieInsertException {
   }
 
   public HoodieFileFormat getBaseFileFormat() {
-return metaClient.getTableConfig().getBaseFileFormat();
-  }
-
-  public HoodieFileFormat getLogFileFormat() {
-return metaClient.getTableConfig().getLogFileFormat();
+HoodieTableConfig tableConfig = metaClient.getTableConfig();
+if (tableConfig.contains(HoodieTableConfig.BASE_FILE_FORMAT)) {
+  return metaClient.getTableConfig().getBaseFileFormat();
+}
+return config.getBaseFileFormat();

Review Comment:
   Should the table config has higher priority or the vise-versa?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758861709

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294)
 
   * 85bf27abe36ef2a6500ed323e64d6598649c95c2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6873) Clustering MOR applies base files after log files

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6873.
-
Fix Version/s: 1.0.0
   0.14.1
   Resolution: Fixed

> Clustering MOR applies base files after log files
> -
>
> Key: HUDI-6873
> URL: https://issues.apache.org/jira/browse/HUDI-6873
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, spark
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
>
> If the payload is overwritewithlatestavropayload this matters because if the 
> base file and the update have the same precombine, then the record in the 
> base file will be used instead of records from later writes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6873] fix clustering mor (#9774)

2023-10-11 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3c439a3a69f [HUDI-6873] fix clustering mor (#9774)
3c439a3a69f is described below

commit 3c439a3a69fb88dee551d94c8266e48b7c0d1e8f
Author: Jon Vexler 
AuthorDate: Wed Oct 11 23:04:55 2023 -0400

[HUDI-6873] fix clustering mor (#9774)

Currently during clustering of noncompacted mor filegroups with row writer 
disabled
(currently the default for clustering), the records in the base file are 
applied to the
log scanner after the log files have been scanned. If they have the same 
precombine,
the base file records will be chosen over the log file records. This commit 
mimics the
implementation in Iterators.scala to make the behavior consistent.

-

Co-authored-by: Jonathan Vexler <=>
---
 .../hudi/common/table/log/CachingIterator.java | 41 +++
 .../common/table/log/HoodieFileSliceReader.java| 75 +--
 .../hudi/common/table/log/LogFileIterator.java | 57 +++
 .../run/strategy/JavaExecutionStrategy.java|  4 +-
 .../MultipleSparkJobExecutionStrategy.java |  4 +-
 .../hudi/sink/clustering/ClusteringOperator.java   |  3 +-
 .../TestHoodieSparkMergeOnReadTableClustering.java |  2 +-
 .../apache/hudi/functional/TestMORDataSource.scala | 85 +-
 8 files changed, 243 insertions(+), 28 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java
new file mode 100644
index 000..d022b92ae22
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log;
+
+import java.util.Iterator;
+
+public abstract class CachingIterator implements Iterator {
+
+  protected T nextRecord;
+
+  protected abstract boolean doHasNext();
+
+  @Override
+  public final boolean hasNext() {
+return nextRecord != null || doHasNext();
+  }
+
+  @Override
+  public final T next() {
+T record = nextRecord;
+nextRecord = null;
+return record;
+  }
+
+}
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java
index fc3ef4b8d92..1aa2f21fcb2 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java
@@ -19,47 +19,80 @@
 
 package org.apache.hudi.common.table.log;
 
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodiePayloadProps;
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieClusteringException;
 import org.apache.hudi.io.storage.HoodieFileReader;
 
 import org.apache.avro.Schema;
 
 import java.io.IOException;
 import java.util.Iterator;
+import java.util.Map;
 import java.util.Properties;
 
-/**
- * Reads records from base file and merges any updates from log files and 
provides iterable over all records in the file slice.
- */
-public class HoodieFileSliceReader implements Iterator> {
+public class HoodieFileSliceReader extends LogFileIterator {
+  private Option> baseFileIterator;
+  private HoodieMergedLogRecordScanner scanner;
+  private Schema schema;
+  private Properties props;
 
-  private final Iterator> recordsIterator;
+  private TypedProperties payloadProps = new TypedProperties();
+  private Option> simpleKeyGenFieldsOpt;
+  Map records;
+  HoodieRecordMerger merger;
 
-  

Re: [PR] [HUDI-6873] fix clustering mor [hudi]

2023-10-11 Thread via GitHub


codope merged PR #9774:
URL: https://github.com/apache/hudi/pull/9774


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6873] fix clustering mor [hudi]

2023-10-11 Thread via GitHub


codope commented on PR #9774:
URL: https://github.com/apache/hudi/pull/9774#issuecomment-1758839664

   Landing this PR. Test failure is unrelated. The integ test failure should be 
fixed by #9843 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1355986341


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieKeyBasedFileGroupRecordBuffer.java:
##
@@ -0,0 +1,250 @@
+/*
+ *
+ *  * Licensed to the Apache Software Foundation (ASF) under one or more
+ *  * contributor license agreements.  See the NOTICE file distributed with
+ *  * this work for additional information regarding copyright ownership.
+ *  * The ASF licenses this file to You under the Apache License, Version 2.0
+ *  * (the "License"); you may not use this file except in compliance with
+ *  * the License.  You may obtain a copy of the License at
+ *  *
+ *  *http://www.apache.org/licenses/LICENSE-2.0
+ *  *
+ *  * Unless required by applicable law or agreed to in writing, software
+ *  * distributed under the License is distributed on an "AS IS" BASIS,
+ *  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  * See the License for the specific language governing permissions and
+ *  * limitations under the License.
+ *
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.KeySpec;
+import org.apache.hudi.common.engine.HoodieReaderContext;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.avro.Schema;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.Map;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
+public class HoodieKeyBasedFileGroupRecordBuffer implements 
HoodieFileGroupRecordBuffer {

Review Comment:
   Can we add some doc to these new classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1355983980


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -102,6 +102,18 @@ public HoodieFileGroupReader(HoodieReaderContext 
readerContext,
 this.start = 0;
 this.length = Long.MAX_VALUE;
 this.baseFileIterator = new EmptyIterator<>();
+this.shouldUseRecordPosition = false;

Review Comment:
   Oh, I see, there is no need to keep 2 constructors here, we can instantiate 
the record buffer from it's factory class with the given flag 
`shouldUseRecordPosition`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1355981308


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -102,6 +102,18 @@ public HoodieFileGroupReader(HoodieReaderContext 
readerContext,
 this.start = 0;
 this.length = Long.MAX_VALUE;
 this.baseFileIterator = new EmptyIterator<>();
+this.shouldUseRecordPosition = false;

Review Comment:
   shouldUseRecordPosition is always false now, do we still need to keep it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758826067

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * bfea4593820ae5257f93f822986ef58168f69dde Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289)
 
   * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20295)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1355980764


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java:
##
@@ -81,11 +80,16 @@ private HoodieMergedLogRecordReader(HoodieReaderContext 
readerContext,
   Option partitionName,
   InternalSchema internalSchema,
   Option keyFieldOverride,
-  boolean enableOptimizedLogBlocksScan, 
HoodieRecordMerger recordMerger) {
+  boolean enableOptimizedLogBlocksScan,
+  HoodieRecordMerger recordMerger,
+  HoodieFileGroupRecordBuffer 
recordBuffer) {
 super(readerContext, fs, basePath, logFilePaths, readerSchema, 
latestInstantTime, readBlocksLazily, reverseReader, bufferSize,
-instantRange, withOperationField, forceFullScan, partitionName, 
internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan, recordMerger);
+instantRange, withOperationField, forceFullScan, partitionName, 
internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan,
+recordMerger, recordBuffer);
 this.records = new HashMap<>();
 this.scannedPrefixes = new HashSet<>();
+this.recordBuffer = recordBuffer;

Review Comment:
   Do we need to assign the record buffer 2 times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758820588

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * bfea4593820ae5257f93f822986ef58168f69dde Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289)
 
   * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758820147

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758813978

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6793:
-

Assignee: Jonathan Vexler

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6790:
-

Assignee: Jonathan Vexler

> Support incremental read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6790
> URL: https://issues.apache.org/jira/browse/HUDI-6790
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6897) Improve SimpleConcurrentFileWritesConflictResolutionStrategy for NB-CC

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6897:
-

Assignee: Sagar Sumit

> Improve SimpleConcurrentFileWritesConflictResolutionStrategy for NB-CC
> --
>
> Key: HUDI-6897
> URL: https://issues.apache.org/jira/browse/HUDI-6897
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> There is no need to throw concurrent modification exception for the simple 
> strategy under NB-CC, because the compactor would finally resolve the 
> conflicts instead.
> Check test case 
> {code:java}
> TestHoodieClientMultiWriter#testMultiWriterWithAsyncTableServicesWithConflict{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-2461) Support lock free multi-writer for metadata table

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-2461:
-

Assignee: Sagar Sumit  (was: Vinoth Chandar)

> Support lock free multi-writer for metadata table
> -
>
> Key: HUDI-2461
> URL: https://issues.apache.org/jira/browse/HUDI-2461
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, multi-writer, writer-core
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>
> Even with synchronous patch, we instantiate metadata table with single writer 
> mode only. 
> But we need to support async compaction and cleaning and hence we need to 
> think about supporting multi-writer down the line. 
> Details:
> all writes to metadata table happens within data table lock, including 
> compaction and cleaning in metadata table since we do inline. But as we scale 
> metadata table infra w/ more indexes, we need to support async compaction and 
> cleaning and so we need multi-writer support. 
> One possibility:
>  - Special transaction management for metadata table. 
> data table commits: all writes to metadata table will be guarded by datatable 
> lock (regular writes, clustering, compaction, everything). regular writes 
> will do usual conflict resolution, where as compaction and clustering may 
> not. 
> Now coming to metadata table commits, there won't be any conflict resolution 
> in general for whole of metadata table. But we will ensure any commit happens 
> by acquiring a lock. Our presumption is that, all the conflict resolution 
> would have happened within data table before proceeding to make a commit in 
> metadata table and so we don't need to do any conflict resolution 
> specifically. 
> Scheduling of compaction and cleaning will happen along w/ regular upserts. 
> and we will have async compaction and cleaning support. so, when these async 
> operations are looking to commit in metadata table, they will acquire lock, 
> make the commit and release the lock. Only one writer will be in progress 
> during metadata commit. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6480) Flink lockless multi-writer

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-6480:
-

Assignee: Danny Chen

> Flink lockless multi-writer
> ---
>
> Key: HUDI-6480
> URL: https://issues.apache.org/jira/browse/HUDI-6480
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6480) Flink lockless multi-writer

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6480:
--
Status: In Progress  (was: Open)

> Flink lockless multi-writer
> ---
>
> Key: HUDI-6480
> URL: https://issues.apache.org/jira/browse/HUDI-6480
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6480) Flink lockless multi-writer

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6480:
--
Fix Version/s: 1.0.0
   (was: 0.14.1)

> Flink lockless multi-writer
> ---
>
> Key: HUDI-6480
> URL: https://issues.apache.org/jira/browse/HUDI-6480
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6781) Add deltacommit timestamp to log file name

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6781.
-
Resolution: Done

> Add deltacommit timestamp to log file name
> --
>
> Key: HUDI-6781
> URL: https://issues.apache.org/jira/browse/HUDI-6781
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6641) Remove the log append and always uses the current instant time in file name

2023-10-11 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6641.
-
Resolution: Done

> Remove the log append and always uses the current instant time in file name
> ---
>
> Key: HUDI-6641
> URL: https://issues.apache.org/jira/browse/HUDI-6641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


stream2000 commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758786265

   @danny0405 Hi Danny,  I have addressed all comments, PTAL~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


stream2000 commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758785514

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758771641

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on issue #9172:
URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758765832

   Here is the fix I found: https://github.com/apache/hudi/pull/8082


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on issue #9172:
URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758764305

   cc @CTTY from the AWS team, do you have any thought that can help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unstable Execution Time and Many RequestHandler WARN Logs [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on issue #8100:
URL: https://github.com/apache/hudi/issues/8100#issuecomment-1758753847

   Did you ever try the latest release, the fs view should perform better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758735292

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 57442675289e3db3252449725380a72977f7e7fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290)
 
   * c8b824bf84288173cdff2b95e4115869154417fe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20293)
 
   * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758729502

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 57442675289e3db3252449725380a72977f7e7fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290)
 
   * c8b824bf84288173cdff2b95e4115869154417fe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20293)
 
   * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758696229

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 57442675289e3db3252449725380a72977f7e7fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290)
 
   * c8b824bf84288173cdff2b95e4115869154417fe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Hudi MERGE INTO on Glue fails when using functions such as (filter, zip_with) on array of structs [hudi]

2023-10-11 Thread via GitHub


rita-ihnatsyeva commented on issue #9838:
URL: https://github.com/apache/hudi/issues/9838#issuecomment-1758607355

   @ad1happy2go I tried your code in prod env, it works fine, so I guess smth 
wrong with my input data, as for now I can't understand what's wrong. Doesn't 
seem like  a reproducible bug 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]

2023-10-11 Thread via GitHub


bkosuru commented on issue #9172:
URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758605903

   Same exception with Hudi 0.14.0  and Spark 3.3.2. (GCP serverless 1.1)
   @danny0405 you said the issue should be resolved with Hudi 0.14.0. Do you 
know why it is still broken?
   
   Works fine with Hudi 0.14.0 and Spark 3.4.0 though. We have to upgrade to 
GCP serverless 2.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Error when running a pipeline after an interrupt [hudi]

2023-10-11 Thread via GitHub


ingridymartinss commented on issue #9518:
URL: https://github.com/apache/hudi/issues/9518#issuecomment-1758174848

   We haven't figured it out yet. :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1355445308


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##
@@ -51,4 +51,11 @@ public class HoodieReaderConfig {
   .sinceVersion("0.13.0")
   .withDocumentation("New optimized scan for log blocks that handles all 
multi-writer use-cases while appending to log files. "
   + "It also differentiates original blocks written by ingestion 
writers and compacted blocks written log compaction.");
+
+  public static final ConfigProperty FILE_GROUP_READER_ENABLED = 
ConfigProperty
+  .key("hoodie.file.group.reader.enabled")
+  .defaultValue(true)
+  .markAdvanced()
+  .sinceVersion("1.0.0")

Review Comment:
   I put it as true just for testing purpose, i will make it false. Sorry for 
the confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6928) Support position based merging in HoodieFileGroupReader

2023-10-11 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774132#comment-17774132
 ] 

Lin Liu commented on HUDI-6928:
---

The PR is ready, but need to address some critical comments on refactoring.

> Support position based merging in HoodieFileGroupReader
> ---
>
> Key: HUDI-6928
> URL: https://issues.apache.org/jira/browse/HUDI-6928
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6928) Support position based merging in HoodieFileGroupReader

2023-10-11 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6928:
--
Status: Patch Available  (was: In Progress)

> Support position based merging in HoodieFileGroupReader
> ---
>
> Key: HUDI-6928
> URL: https://issues.apache.org/jira/browse/HUDI-6928
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6928) Support position based merging in HoodieFileGroupReader

2023-10-11 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6928:
--
Status: In Progress  (was: Open)

> Support position based merging in HoodieFileGroupReader
> ---
>
> Key: HUDI-6928
> URL: https://issues.apache.org/jira/browse/HUDI-6928
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

2023-10-11 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6786:
--
Status: Patch Available  (was: In Progress)

> Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR 
> Snapshot Query
> --
>
> Key: HUDI-6786
> URL: https://issues.apache.org/jira/browse/HUDI-6786
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Goal: When `NewHoodieParquetFileFormat` is enabled with 
> `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR 
> Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic 
> MOR snapshot query should pass (except for the caveats in the current 
> HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in 
> this EPIC).
> The query logic is implemented in 
> `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the 
> following code for MOR snapshot query:
> {code:java}
> else {
>   if (logFiles.nonEmpty) {
> val baseFile = createPartitionedFile(InternalRow.empty, 
> hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
> buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, 
> filePath.getParent, requiredSchemaWithMandatory,
>   requiredSchemaWithMandatory, outputSchema, partitionSchema, 
> partitionValues, broadcastedHadoopConf.value.value)
>   } else {
> throw new IllegalStateException("should not be here since file slice 
> should not have been broadcasted since it has no log or data files")
> //baseFileReader(baseFile)
>   } {code}
> `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, 
> with a new config `hoodie.read.use.new.file.group.reader`, by passing in the 
> correct base and log file list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757979131

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9843:
URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757960568

   
   ## CI report:
   
   * 6b231840828f6b70965f4015de976500918f5703 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757957453

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan closed issue #9827: MacOs M1  Exception in thread "main" 
java.io.IOException: Could not load schema provider class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
URL: https://github.com/apache/hudi/issues/9827


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757888560

   Thank you for helping me. The problem has been resolved, so I will close 
this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757878099

   > m1 chip does not have good compatibility, can you try arm64 chip instead?
   
   thank you,give more suggestions.  M1 chip is based on the arm64 architecture.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


stream2000 commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757863389

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757864136

   I analyzed the reason for this issue based on the source code,sharing here 
hopes to be helpful to others .
   
   In master branch,when you create docker contain.It  copy docker/demo and 
docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar into  directory  
/var/hoodie/ws/docker/..  in adhoc-2 contain.So execute spark-submit, it use 
/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar.  But 
the jar may compiled by me in branch 13.0 code.So it should be caused by a 
version mismatch, because I did not compile the master branch code locally 
successfully, and my local hoodie-utilities.jar was previously compiled using 
branch 13.0.
   https://github.com/apache/hudi/assets/19281198/12a3e983-344d-44bb-967e-18b1d270660c;>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-11 Thread via GitHub


ad1happy2go commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757851940

   @noahtaite Yes, Converting to integer type before saving will work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757850550

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757792367

   > @hanrongMan I was able to run the complete docker demo with 0.14.0 on M1 
and didn't faced any issues. What build command you are using? As default 
profiles have changes, can you try with this -
   > 
   > mvn -U clean package -Pintegration-tests -DskipTests -Dscala-2.11 
-Dspark2.4
   
   thank you . I switched to branch 14.0, compiled using your command, and run 
successfully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-11 Thread via GitHub


noahtaite commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757792128

   Hello @danny0405 we are ingesting from ~300 tables in ~2k customer databases 
which are not fully constrained - we can expect to see null values in many 
fields. In this case I believe it is a missing linking ID.
   
   @ad1happy2go happy to hear you have reproduced the issue, looking forward to 
hearing about a workaround and timeline for fix.
   
   Two things to note
   1 -  I also reproduced this issue with ByteType which it seems that Hudi is 
handling exact same as ShortType
   2 - Our current workaround (temp + hacky) is to convert all incoming 
ShortType + ByteType to IntegerType before saving to Hudi. This is working in 
our dev environment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757789513

   > @hanrongMan I was able to run the complete docker demo with 0.14.0 on M1 
and didn't faced any issues. What build command you are using? As default 
profiles have changes, can you try with this -
   > 
   > mvn -U clean package -Pintegration-tests -DskipTests -Dscala-2.11 
-Dspark2.4
   
   thank you . I switched to branch 14.0, compiled using your command, and  run 
successfully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan commented on issue #9827:
URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757781453

   > m1 chip does not have good compatibility, can you try arm64 chip instead?
   
   apple M1 chip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]

2023-10-11 Thread via GitHub


hanrongMan closed issue #9827: MacOs M1  Exception in thread "main" 
java.io.IOException: Could not load schema provider class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
URL: https://github.com/apache/hudi/issues/9827


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Glue 4.0 Hudi 0.12.1 PreCommit validator i.e SqlQueryEqualityPreCommitValidator is not working [hudi]

2023-10-11 Thread via GitHub


ad1happy2go commented on issue #9183:
URL: https://github.com/apache/hudi/issues/9183#issuecomment-1757756659

   @abhisheksahani91 Do you have any more issues/doubts around this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Doubt about handling old data arrival in hudi [hudi]

2023-10-11 Thread via GitHub


codope closed issue #8576: [SUPPORT] Doubt about handling old data arrival in 
hudi
URL: https://github.com/apache/hudi/issues/8576


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Docker Demo Issue With Current master(0.14.0-SNAPSHOT) [hudi]

2023-10-11 Thread via GitHub


codope closed issue #8447: [SUPPORT] Docker Demo Issue With Current 
master(0.14.0-SNAPSHOT)
URL: https://github.com/apache/hudi/issues/8447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Docker Demo Issue With Current master(0.14.0-SNAPSHOT) [hudi]

2023-10-11 Thread via GitHub


ad1happy2go commented on issue #8447:
URL: https://github.com/apache/hudi/issues/8447#issuecomment-1757712285

   @agrawalreetika I confirmed, docker demo is working fine with latest release 
0.14.0 also. Closing this issue. Please reopen in case of any concerns. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Slow performance when inserting to a lot of partitions and metadata enabled [hudi]

2023-10-11 Thread via GitHub


VitoMakarevich opened a new issue, #9848:
URL: https://github.com/apache/hudi/issues/9848

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Recently we enabled metadata for one big table. This table includes ~700k 
partitions in the test environment and the usual insert affects ~600. Also it's 
important to say it's even more pressing if we have enabled the timeline server.
   The issue we have is visible in the slow `getting small files` stage - e.g. 
for 600 insert-affected partitions(600 tasks in this stage) it takes ~2-3 
minutes, and for 12k tasks(partitions affected) - ~40 minutes.
   important details:
   1. Metadata is enabled
   2. Timeline server is enabled
   3. Metadata table `hfile` file size - about 40MB
   4. Number of log files: "hoodie.metadata.compact.max.delta.commits" = "1"
   5. Only file listing is enabled in metadata
   I dug into this a lot and enabled detailed logs and this is what I think 
maybe the issue:
   In logs "Metadata read for %s keys took [baseFileRead, logMerge] %s ms" - 
half of the values are single digit numbers, half are > 1000, e.g. Metadata 
read for 1 keys took [baseFileRead, logMerge] [0, 12456] ms.
   There are also logs like
   `Updating metadata metrics (basefile_read.totalDuration=12155)`
   `Updating metadata metrics (lookup_files.totalDuration=12156)`
   So my suspect is that given such a large metadata `file`(40mb), it looks 
like hfile lookup is suboptimal. Do you aware of any issue similar to this(in 
0.12.1 and 0.12.2)? As I see in the code it seeks up to a partition path, may 
it happen that readers are somehow now being reused well, so these 40mb files 
are seeked again and again thousands of times?
   As a remediate, we turned off the embed server, so as I understand same 
thing will be done on executors, but it will be less problematic since 
parallelism is much bigger.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Probably generating a large set of partitions(up to 30-40 MB hfile size) and 
running insert to thousands of partitions may reproduce it.
   
   **Expected behavior**
   
   `getting small files` should not be so long. Without metadata + embed server 
it takes ~40sec to check 12k partitions, while with metadata it's 40+ minutes.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-0.12.2
   
   * Spark version : 3.3.0-3.3.1
   
   * Hive version :
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   I can try to make a reproduction if you don't know about anything like this.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9843:
URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757693957

   
   ## CI report:
   
   * 4a77d04deabcc24baa73f35a64509f86fd84d03c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20283)
 
   * 6b231840828f6b70965f4015de976500918f5703 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9843:
URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757597868

   
   ## CI report:
   
   * 4a77d04deabcc24baa73f35a64509f86fd84d03c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20283)
 
   * 6b231840828f6b70965f4015de976500918f5703 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757581642

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * bfea4593820ae5257f93f822986ef58168f69dde Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757510755

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 57442675289e3db3252449725380a72977f7e7fa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757505901

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286)
 
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757486587

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286)
 
   * ee3bbd6595f8a69ecaf53d9ac2b445533958832c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6924] Fix hoodie table config not wok in table properties [hudi]

2023-10-11 Thread via GitHub


boneanxs commented on code in PR #9836:
URL: https://github.com/apache/hudi/pull/9836#discussion_r1354717038


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala:
##
@@ -199,7 +184,7 @@ object HoodieOptionConfig {
 
   // extract primaryKey, preCombineField, type options
   def extractSqlOptions(options: Map[String, String]): Map[String, String] = {
-val sqlOptions = mapTableConfigsToSqlOptions(options)
+val sqlOptions = mapHoodieOptionsToSqlOptions(options)

Review Comment:
   `mapHoodieConfigsToSqlOptions` should be more accurate?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala:
##
@@ -169,22 +170,6 @@ object HoodieOptionConfig {
   .toMap
   }
 
-  /**
-   * Get the table type from the table options.
-   * @param options
-   * @return
-   */
-  def getTableType(options: Map[String, String]): String = {

Review Comment:
   No need to delete this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757389269

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287)
 
   * 57442675289e3db3252449725380a72977f7e7fa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757367948

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287)
 
   * 57442675289e3db3252449725380a72977f7e7fa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757366765

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * 2543e0a84a337b02e4af14a3f8b2c4dafbd6d558 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20266)
 
   * bfea4593820ae5257f93f822986ef58168f69dde Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757346235

   
   ## CI report:
   
   * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN
   * 2543e0a84a337b02e4af14a3f8b2c4dafbd6d558 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20266)
 
   * bfea4593820ae5257f93f822986ef58168f69dde UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]PrestoDB failed to query data from mor table [hudi]

2023-10-11 Thread via GitHub


ChandrasekharPo-Kore commented on issue #8078:
URL: https://github.com/apache/hudi/issues/8078#issuecomment-1757345076

   Any progress on this one.
   The query works on _ro but fails with this error on _rt
   This is forcing us to use cow table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757344904

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


boneanxs commented on code in PR #9617:
URL: https://github.com/apache/hudi/pull/9617#discussion_r1354592241


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimeGeneratorBase.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.common.config.HoodieTimeGeneratorConfig;
+import org.apache.hudi.common.config.LockConfiguration;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.lock.LockProvider;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.exception.HoodieLockException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.Serializable;
+import java.util.concurrent.TimeUnit;
+
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS;
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_NUM_RETRIES;
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_CLIENT_NUM_RETRIES_PROP_KEY;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY;
+
+/**
+ * Base time generator facility that maintains lock-related utilities.
+ */
+public abstract class TimeGeneratorBase implements TimeGenerator, Serializable 
{
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TimeGeneratorBase.class);
+
+  /**
+   * The lock provider.
+   */
+  private volatile LockProvider lockProvider;
+  /**
+   * The maximum times to retry in case there are failures.
+   */
+  private final int maxRetries;
+  /**
+   * The maximum time to wait for each time generation to resolve the clock 
skew issue on distributed hosts.
+   */
+  private final long maxWaitTimeInMs;
+  /**
+   * The maximum time to block for acquiring a lock.
+   */
+  private final int lockAcquireWaitTimeInMs;
+
+  protected final HoodieTimeGeneratorConfig config;
+  private final LockConfiguration lockConfiguration;
+
+  /**
+   * The hadoop configuration.
+   */
+  private final SerializableConfiguration hadoopConf;
+
+  public TimeGeneratorBase(HoodieTimeGeneratorConfig config, 
SerializableConfiguration hadoopConf) {
+this.config = config;
+this.lockConfiguration = config.getLockConfiguration();
+this.hadoopConf = hadoopConf;
+
+maxRetries = 
lockConfiguration.getConfig().getInteger(LOCK_ACQUIRE_CLIENT_NUM_RETRIES_PROP_KEY,
+Integer.parseInt(DEFAULT_LOCK_ACQUIRE_NUM_RETRIES));
+lockAcquireWaitTimeInMs = 
lockConfiguration.getConfig().getInteger(LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY,
+Integer.parseInt(DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS));
+maxWaitTimeInMs = 
lockConfiguration.getConfig().getLong(LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY,
+Long.parseLong(DEFAULT_LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS));
+  }
+
+  protected LockProvider getLockProvider() {
+// Perform lazy initialization of lock provider only if needed
+if (lockProvider == null) {
+  synchronized (this) {
+if (lockProvider == null) {
+  String lockProviderClass = 
lockConfiguration.getConfig().getString("hoodie.write.lock.provider");
+  LOG.info("LockProvider for TimeGenerator: " + lockProviderClass);
+  lockProvider = (LockProvider) 
ReflectionUtils.loadClass(lockProviderClass,
+  lockConfiguration, hadoopConf.get());
+}
+  }
+}
+return lockProvider;
+  }
+
+  public void lock() {

Review Comment:
   Use `RetryHelper` to simplify code here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]

2023-10-11 Thread via GitHub


boneanxs commented on PR #9617:
URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757292134

   > Looks good overall, is there anyway we can abstract that failure retries 
as a common utility? Add add a UT for TimeGenerator.
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9118:
URL: https://github.com/apache/hudi/pull/9118#discussion_r1354583732


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java:
##
@@ -488,11 +505,24 @@ private void flushRemaining(boolean endInput) {
 this.writeStatuses.addAll(writeStatus);
 // blocks flushing until the coordinator starts a new instant
 this.confirming = true;
+
+writeMetrics.endFlushing();
+writeMetrics.resetAfterCommit();
+  }
+
+  private void registerMetrics() {
+MetricGroup metrics = getRuntimeContext().getMetricGroup();
+writeMetrics = new FlinkStreamWriteMetrics(metrics);
+writeMetrics.registerMetrics();
   }
 
   protected List writeBucket(String instant, DataBucket bucket, 
List records) {
 bucket.preWrite(records);
-return writeFunction.apply(records, instant);
+writeMetrics.startHandleClose();
+List statuses = writeFunction.apply(records, instant);

Review Comment:
   Maybe we just rename `checkpointFlush` -> `dataFlush` and `singleFileFlush` 
-> `fileFlush` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9118:
URL: https://github.com/apache/hudi/pull/9118#discussion_r1354581902


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java:
##
@@ -155,5 +168,15 @@ private void flushData(boolean endInput) {
 this.writeStatuses.addAll(writeStatus);
 // blocks flushing until the coordinator starts a new instant
 this.confirming = true;
+
+writeMetrics.endCheckpointFlushing();
+LOG.info("Flushing costs: {} ms", writeMetrics.getCheckpointFlushCosts());
+writeMetrics.resetAfterCommit();

Review Comment:
   We better avoid the logging for each data flush.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Unstable Execution Time and Many RequestHandler WARN Logs [hudi]

2023-10-11 Thread via GitHub


lovemylover042 commented on issue #8100:
URL: https://github.com/apache/hudi/issues/8100#issuecomment-1757266624

   @danny0405 I found delta commit became so slowly because it use secondary 
filesystem view when got a bad response from remote timeline server.  I think 
the bad response was caused by compaction running at the same time and timeline 
server was behind the client. Can i  force sync local view if timeline server 
was behind the client ? 
   -
   org.apache.hudi.timeline.service.RequestHandler line 501:
   // TODO: set refreshCheck to be true when timeline server became behind 
several times or some seconds
   if (refreshCheck) {
 long beginFinalCheck = System.currentTimeMillis();
 if (isLocalViewBehind(context)) {
   String errMsg =
   "Last known instant from client was "
   + 
context.queryParam(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS,
   HoodieTimeline.INVALID_INSTANT_TS)
   + " but server has the following timeline "
   + 
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM))
   
.getTimeline().getInstants().collect(Collectors.toList());
   throw new BadRequestResponse(errMsg);
 }
 long endFinalCheck = System.currentTimeMillis();
 finalCheckTimeTaken = endFinalCheck - beginFinalCheck;
   }
   -
   Environment Description: 
   Hudi version : 0.10.1
   Spark version : 3.0.1
   Hadoop version : 3.1.1
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


danny0405 commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1354565693


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##
@@ -51,4 +51,11 @@ public class HoodieReaderConfig {
   .sinceVersion("0.13.0")
   .withDocumentation("New optimized scan for log blocks that handles all 
multi-writer use-cases while appending to log files. "
   + "It also differentiates original blocks written by ingestion 
writers and compacted blocks written log compaction.");
+
+  public static final ConfigProperty FILE_GROUP_READER_ENABLED = 
ConfigProperty
+  .key("hoodie.file.group.reader.enabled")
+  .defaultValue(true)
+  .markAdvanced()
+  .sinceVersion("1.0.0")

Review Comment:
   Does this option exist because we do not have enough confidence that the new 
reader would cover all the read use cases? Or not sure it is robust enough? I 
see the default value is true, that means user would anyway encounter these 
problems, can we address them and just remove this option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757244622

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan

2023-10-11 Thread Kong Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei closed HUDI-6714.
--
Resolution: Won't Do

already has this parameter to enable such feature
hoodie.compact.schedule.inline

> HoodieStreamer support only schedule the compaction plan but not execute the 
> plan
> -
>
> Key: HUDI-6714
> URL: https://issues.apache.org/jira/browse/HUDI-6714
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
> mode can be *async.*
> In the async compaction mode, the hoodie-streamer will schedule one 
> compaction plan after each write operation and execute compaction plan if 
> need. But the execution of compaction will share the spark job resource, 
> which may cause the write delay.
> In our cases, we want to execute the compaction offline to save the spark 
> resource of streamer and reduce the write latency. And we found that 
> scheduling the compaction plan offline will fail while streamer is writing 
> (means we have to stop the streamer in order to schedule the plan offline). 
> So we want the streamer only to schedule the compaction plan but not to 
> execute it.
> But currently the streamer seems not support such case. If we set the 
> `--disable-compaction` to false, the streamer will not schedule the 
> compaction plan anymore.
> So I want to add a param named --{_}enable-schedule-compaction{_} in the 
> streamer,
> and we can set --{_}disable-compaction{_}=false and 
> {_}enable-schedule-compaction{_}=true to enable only schedule the compaction 
> in streamer.
> the cases like below:
> ||--disable-compaction||--enable-schedule-compaction||schedule plan||execute 
> plan||
> |true|true or false|true|true|
> |false|true|true|false|
> |false|false|false|false|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Enable metadata table, Spark write mor table duplicate data [hudi]

2023-10-11 Thread via GitHub


ad1happy2go commented on issue #9714:
URL: https://github.com/apache/hudi/issues/9714#issuecomment-1757232992

   ```  
 inserts.write.format("org.apache.hudi")
   .option("hoodie.datasource.write.recordkey.field", "uuid")
   .option("hoodie.datasource.write.partitionpath.field", "city")
   .option("hoodie.parquet.small.file.limit", 128)
   .option("hoodie.datasource.hive_sync.partition_fields", "city")
   .option("hoodie.upsert.shuffle.parallelism", 200)
   .option("hoodie.datasource.write.operation", "upsert")
   .option("hoodie.datasource.write.precombine.field", "ts")
   .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
   .option("hoodie.embed.timeline.server", true)
   .option("hoodie.datasource.write.streaming.ignore.failed.batch", 
false)
   .option("hoodie.cleaner.commits.retained", "15")
   .option("hoodie.datasource.hive_sync.table_properties", 
"spark.sql.partitionProvider=catalog")
   .option("hoodie.keep.min.commits", 25)
   .option("hoodie.keep.max.commits", 30)
   .option("hoodie.clean.async", false)
   .option("hoodie.table.name", tableName)
   .option("hoodie.datasource.write.payload.class", 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload")
   .option("hoodie.payload.event.time.field", "ts")
   .option("hoodie.payload.ordering.field", "ts")
   .option("hoodie.write.markers.type", "DIRECT")
   .option("hoodie.metadata.enable", "true")
   .option("hoodie.metadata.index.bloom.filter.enable", "true")
   .option("hoodie.metadata.index.column.stats.enable", "true")
   .option("hoodie.metadata.index.column.stats.column.list", "uuid")
   .option("hoodie.bloom.index.use.metadata", "true")
   .mode(Append)
   .save(basePath);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan

2023-10-11 Thread Kong Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei updated HUDI-6714:
---
Description: 
For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
mode can be *async.*

In the async compaction mode, the hoodie-streamer will schedule one compaction 
plan after each write operation and execute compaction plan if need. But the 
execution of compaction will share the spark job resource, which may cause the 
write delay.

In our cases, we want to execute the compaction offline to save the spark 
resource of streamer and reduce the write latency. And we found that scheduling 
the compaction plan offline will fail while streamer is writing (means we have 
to stop the streamer in order to schedule the plan offline). So we want the 
streamer only to schedule the compaction plan but not to execute it.

But currently the streamer seems not support such case. If we set the 
`--disable-compaction` to false, the streamer will not schedule the compaction 
plan anymore.

So I want to add a param named --{_}enable-schedule-compaction{_} in the 
streamer,

and we can set --{_}disable-compaction{_}=false and 
{_}enable-schedule-compaction{_}=true to enable only schedule the compaction in 
streamer.

the cases like below:
||--disable-compaction||--enable-schedule-compaction||schedule plan||execute 
plan||
|true|true or false|true|true|
|false|true|true|false|
|false|false|false|false|

 

  was:
For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
mode can be *async.*

In the async compaction mode, the hoodie-streamer will schedule one compaction 
plan after each write operation and execute compaction plan if need. But the 
execution of compaction will share the spark job resource, which may cause the 
write delay.

In our cases, we want to execute the compaction offline to save the spark 
resource for streamer and reduce the write latency. And we found that 
scheduling the compaction plan offline will fail while streamer is writing 
(means we have to stop the streamer in order to schedule the plan offline). So 
we only want the streamer to schedule the compaction but not to execute it.

But currently the streamer seems not support such case. If we set the 
`--disable-compaction` to false, the streamer will not schedule the compaction 
anymore.

So I want to add a param named --{_}enable-schedule-compaction{_} in the 
streamer,

and we can set --{_}disable-compaction{_}=false and 
{_}enable-schedule-compaction{_}=true to enable only schedule the compaction in 
streamer.

the cases like below:
||param case||schedule plan||execute plan||
|--disable-compaction = true
no matter --enable-schedule-compaction|true|true|
|--disable-compaction = false
--enable-schedule-compaction = true|true|false|
|--disable-compaction = false
--enable-schedule-compaction = false|false|false|

 


> HoodieStreamer support only schedule the compaction plan but not execute the 
> plan
> -
>
> Key: HUDI-6714
> URL: https://issues.apache.org/jira/browse/HUDI-6714
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
> mode can be *async.*
> In the async compaction mode, the hoodie-streamer will schedule one 
> compaction plan after each write operation and execute compaction plan if 
> need. But the execution of compaction will share the spark job resource, 
> which may cause the write delay.
> In our cases, we want to execute the compaction offline to save the spark 
> resource of streamer and reduce the write latency. And we found that 
> scheduling the compaction plan offline will fail while streamer is writing 
> (means we have to stop the streamer in order to schedule the plan offline). 
> So we want the streamer only to schedule the compaction plan but not to 
> execute it.
> But currently the streamer seems not support such case. If we set the 
> `--disable-compaction` to false, the streamer will not schedule the 
> compaction plan anymore.
> So I want to add a param named --{_}enable-schedule-compaction{_} in the 
> streamer,
> and we can set --{_}disable-compaction{_}=false and 
> {_}enable-schedule-compaction{_}=true to enable only schedule the compaction 
> in streamer.
> the cases like below:
> ||--disable-compaction||--enable-schedule-compaction||schedule plan||execute 
> plan||
> |true|true or false|true|true|
> |false|true|true|false|
> |false|false|false|false|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan

2023-10-11 Thread Kong Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei updated HUDI-6714:
---
Description: 
For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
mode can be *async.*

In the async compaction mode, the hoodie-streamer will schedule one compaction 
plan after each write operation and execute compaction plan if need. But the 
execution of compaction will share the spark job resource, which may cause the 
write delay.

In our cases, we want to execute the compaction offline to save the spark 
resource for streamer and reduce the write latency. And we found that 
scheduling the compaction plan offline will fail while streamer is writing 
(means we have to stop the streamer in order to schedule the plan offline). So 
we only want the streamer to schedule the compaction but not to execute it.

But currently the streamer seems not support such case. If we set the 
`--disable-compaction` to false, the streamer will not schedule the compaction 
anymore.

So I want to add a param named --{_}enable-schedule-compaction{_} in the 
streamer,

and we can set --{_}disable-compaction{_}=false and 
{_}enable-schedule-compaction{_}=true to enable only schedule the compaction in 
streamer.

the cases like below:
||param case||schedule plan||execute plan||
|--disable-compaction = true
no matter --enable-schedule-compaction|true|true|
|--disable-compaction = false
--enable-schedule-compaction = true|true|false|
|--disable-compaction = false
--enable-schedule-compaction = false|false|false|

 

  was:
For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
mode can *async.*

In the async compaction mode, the hoodie-streamer will schedule one compaction 
plan after each write operation and execute compaction plan if need. But the 
execution of compaction will share the spark job resource, which may cause the 
write delay.

In our cases, we want to execute the compaction offline to save the spark 
resource for streamer and reduce the write latency. And we found that 
scheduling the compaction plan offline will fail while streamer is writing 
(means we have to stop the streamer in order to schedule the plan offline). So 
we only want the streamer to schedule the compaction but not to execute it.

But currently the streamer seems not support such case. If we set the 
`--disable-compaction` to false, the streamer will not schedule the compaction 
anymore.

So I want to add a param named --{_}enable-schedule-compaction{_} in the 
streamer,

and we can set --{_}disable-compaction{_}=false and 
{_}enable-schedule-compaction{_}=true to enable only schedule the compaction in 
streamer.

the cases like below:
||param case||schedule plan||execute plan||
|--disable-compaction = true
no matter --enable-schedule-compaction|true|true|
|--disable-compaction = false
--enable-schedule-compaction = true|true|false|
|--disable-compaction = false
--enable-schedule-compaction = false|false|false|

 


> HoodieStreamer support only schedule the compaction plan but not execute the 
> plan
> -
>
> Key: HUDI-6714
> URL: https://issues.apache.org/jira/browse/HUDI-6714
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction 
> mode can be *async.*
> In the async compaction mode, the hoodie-streamer will schedule one 
> compaction plan after each write operation and execute compaction plan if 
> need. But the execution of compaction will share the spark job resource, 
> which may cause the write delay.
> In our cases, we want to execute the compaction offline to save the spark 
> resource for streamer and reduce the write latency. And we found that 
> scheduling the compaction plan offline will fail while streamer is writing 
> (means we have to stop the streamer in order to schedule the plan offline). 
> So we only want the streamer to schedule the compaction but not to execute it.
> But currently the streamer seems not support such case. If we set the 
> `--disable-compaction` to false, the streamer will not schedule the 
> compaction anymore.
> So I want to add a param named --{_}enable-schedule-compaction{_} in the 
> streamer,
> and we can set --{_}disable-compaction{_}=false and 
> {_}enable-schedule-compaction{_}=true to enable only schedule the compaction 
> in streamer.
> the cases like below:
> ||param case||schedule plan||execute plan||
> |--disable-compaction = true
> no matter --enable-schedule-compaction|true|true|
> |--disable-compaction = false
> --enable-schedule-compaction = true|true|false|
> |--disable-compaction = false
> --enable-schedule-compaction = false|false|false|
>  



--
This 

[hudi] branch master updated: [HUDI-6925] Do not list all partitions for 'alter table drop partition' (#9837)

2023-10-11 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 21a8a3c3693 [HUDI-6925] Do not list all partitions for 'alter table 
drop partition' (#9837)
21a8a3c3693 is described below

commit 21a8a3c3693d550005b098f833630c6af8106aa7
Author: StreamingFlames <18889897...@163.com>
AuthorDate: Wed Oct 11 03:17:21 2023 -0500

[HUDI-6925] Do not list all partitions for 'alter table drop partition' 
(#9837)
---
 .../scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
index d5f46936be5..9751624e3bf 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
@@ -326,15 +326,7 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport {
   def getPartitionPathToDrop(
   hoodieCatalogTable: HoodieCatalogTable,
   normalizedSpecs: Seq[Map[String, String]]): 
String = {
-val table = hoodieCatalogTable.table
-val allPartitionPaths = hoodieCatalogTable.getPartitionPaths
-val enableHiveStylePartitioning = 
isHiveStyledPartitioning(allPartitionPaths, table)
-val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table)
-val partitionFields = hoodieCatalogTable.partitionFields
-val partitionsToDrop = normalizedSpecs.map(
-  makePartitionPath(partitionFields, _, enableEncodeUrl, 
enableHiveStylePartitioning)
-).mkString(",")
-partitionsToDrop
+normalizedSpecs.map(makePartitionPath(hoodieCatalogTable, _)).mkString(",")
   }
 
   private def makePartitionPath(partitionFields: Seq[String],



Re: [PR] [HUDI-6925] Do not list all partitions for 'alter table drop partition' [hudi]

2023-10-11 Thread via GitHub


leesf merged PR #9837:
URL: https://github.com/apache/hudi/pull/9837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


linliu-code commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1354336953


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -146,28 +154,52 @@ public void initRecordIterators() {
* @return {@code true} if the next record exists; {@code false} otherwise.
* @throws IOException on reader error.
*/
-  public boolean hasNext() throws IOException {
+  public boolean hasNext() {
+// Merge records from base file and log files.
+int baseFileSequenceNo = 0;
 while (baseFileIterator.hasNext()) {
   T baseRecord = baseFileIterator.next();
-  String recordKey = readerContext.getRecordKey(baseRecord, 
readerState.baseFileAvroSchema);
-  Pair, Map> logRecordInfo = 
logFileRecordMapping.remove(recordKey);
-  Option resultRecord = logRecordInfo != null
-  ? merge(Option.of(baseRecord), Collections.emptyMap(), 
logRecordInfo.getLeft(), logRecordInfo.getRight())
-  : merge(Option.empty(), Collections.emptyMap(), 
Option.of(baseRecord), Collections.emptyMap());
+  Pair, Map> logRecordInfo;
+
+  if (shouldUseRecordPosition) {

Review Comment:
   Good point. I also feel if-else is not very clean for new logic. After 
adding partial-merging, we will have many readers. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757040548

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 3f5ff2bfb0f3446718546765a9838d595317f748 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20285)
 
   * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757038861

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * a9b387e611bdc9c492a27c6adffe2bf74662be96 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19956)
 
   * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs [hudi]

2023-10-11 Thread via GitHub


wombatu-kun commented on code in PR #5667:
URL: https://github.com/apache/hudi/pull/5667#discussion_r1354321206


##
rfc/rfc-54/rfc-54.md:
##
@@ -0,0 +1,175 @@
+
+
+# RFC-54: New Table APIs and Streamline Hudi Configs
+
+## Proposers
+
+- @codope
+
+## Approvers
+
+- @xushiyan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record

Review Comment:
   I thought there is kinda naming convention in community: prefix "hudi" - is 
for project and its submodules, but "hoodie" - is for classes. May be it is 
better don't break this rule and do not use HudiTable as a class name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9819:
URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757026871

   
   ## CI report:
   
   * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN
   * 3f5ff2bfb0f3446718546765a9838d595317f748 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20285)
 
   * e345e838d5daa8c25475e3b12e149d7e5abc5229 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]

2023-10-11 Thread via GitHub


hudi-bot commented on PR #9118:
URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757025319

   
   ## CI report:
   
   * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN
   * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN
   * a9b387e611bdc9c492a27c6adffe2bf74662be96 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19956)
 
   * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs [hudi]

2023-10-11 Thread via GitHub


wombatu-kun commented on code in PR #5667:
URL: https://github.com/apache/hudi/pull/5667#discussion_r1354314469


##
rfc/rfc-54/rfc-54.md:
##
@@ -0,0 +1,183 @@
+
+
+# RFC-54: New Table APIs and Streamline Hudi Configs
+
+## Proposers
+
+- @codope
+
+## Approvers
+
+- @xushiyan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record
+payload, etc. On one hand, this is the true strength of Hudi which makes it
+suitable for many use cases and offers the users a solution to the tradeoffs
+encountered in data systems. On the other, it has also resulted in the learning
+curve for new users to be steeper. In this RFC, we propose to streamline some 
of
+these configurations. Additionally, we propose a few table level APIs to create
+or update Hudi table programmatically. Together, they would help in a smoother
+onboarding experience and increase the usability of Hudi. It would also help
+existing users through better configuration maintenance.
+
+## Background
+
+Currently, users can create and update Hudi Table using three different
+ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
+[SQL](https://hudi.apache.org/docs/table_management)
+and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each 
one

Review Comment:
   but there is no DeltaStreamer anymore. it was renamed to just Streamer 
https://hudi.apache.org/docs/hoodie_streaming_ingestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >