Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758945698 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 85bf27abe36ef2a6500ed323e64d6598649c95c2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20298) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758932211 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758889101 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294) * 85bf27abe36ef2a6500ed323e64d6598649c95c2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20298) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1356029231 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMultiFileFormatRelation.scala: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hudi.HoodieBaseRelation.projectReader +import org.apache.hudi.HoodieConversionUtils.toScalaOption +import org.apache.hudi.HoodieMultiFileFormatRelation.{createPartitionedFile, inferFileFormat} +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, HoodieLogFile} +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.execution.datasources.{FilePartition, PartitionedFile} +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.StructType + +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * Base split for all Hoodie multi-file format relations. + */ +case class HoodieMultiFileFormatSplit(baseFile: Option[PartitionedFile], + logFiles: List[HoodieLogFile]) extends HoodieFileSplit + +/** + * Base relation to handle table with multiple base file formats. + */ +abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: SQLContext, + override val metaClient: HoodieTableMetaClient, Review Comment: What is the reason we need a new relation abstraction here? The base file format can be always inferred from the file extension right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1356027186 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -516,6 +515,99 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, */ def updatePrunedDataSchema(prunedSchema: StructType): Relation + protected def createBaseFileReaders(tableSchema: HoodieTableSchema, + requiredSchema: HoodieTableSchema, + requestedColumns: Array[String], + requiredFilters: Seq[Filter], + optionalFilters: Seq[Filter] = Seq.empty, + baseFileFormat: HoodieFileFormat = tableConfig.getBaseFileFormat): HoodieMergeOnReadBaseFileReaders = { +val (partitionSchema, dataSchema, requiredDataSchema) = + tryPrunePartitionColumns(tableSchema, requiredSchema) + +val fullSchemaReader = createBaseFileReader( + spark = sqlContext.sparkSession, + partitionSchema = partitionSchema, + dataSchema = dataSchema, + requiredDataSchema = dataSchema, + // This file-reader is used to read base file records, subsequently merging them with the records + // stored in delta-log files. As such, we have to read _all_ records from the base file, while avoiding + // applying any filtering _before_ we complete combining them w/ delta-log records (to make sure that + // we combine them correctly); + // As such only required filters could be pushed-down to such reader + filters = requiredFilters, + options = optParams, + // NOTE: We have to fork the Hadoop Config here as Spark will be modifying it + // to configure Parquet reader appropriately + hadoopConf = embedInternalSchema(new Configuration(conf), internalSchemaOpt), + baseFileFormat = baseFileFormat +) + +val requiredSchemaReader = createBaseFileReader( + spark = sqlContext.sparkSession, + partitionSchema = partitionSchema, + dataSchema = dataSchema, + requiredDataSchema = requiredDataSchema, + // This file-reader is used to read base file records, subsequently merging them with the records + // stored in delta-log files. As such, we have to read _all_ records from the base file, while avoiding + // applying any filtering _before_ we complete combining them w/ delta-log records (to make sure that + // we combine them correctly); + // As such only required filters could be pushed-down to such reader + filters = requiredFilters, + options = optParams, + // NOTE: We have to fork the Hadoop Config here as Spark will be modifying it + // to configure Parquet reader appropriately + hadoopConf = embedInternalSchema(new Configuration(conf), requiredDataSchema.internalSchema), + baseFileFormat = baseFileFormat +) + +// Check whether fields required for merging were also requested to be fetched +// by the query: +//- In case they were, there's no optimization we could apply here (we will have +//to fetch such fields) +//- In case they were not, we will provide 2 separate file-readers +//a) One which would be applied to file-groups w/ delta-logs (merging) +//b) One which would be applied to file-groups w/ no delta-logs or +// in case query-mode is skipping merging +val mandatoryColumns = mandatoryFields.map(HoodieAvroUtils.getRootLevelFieldName) +if (mandatoryColumns.forall(requestedColumns.contains)) { + HoodieMergeOnReadBaseFileReaders( +fullSchemaReader = fullSchemaReader, +requiredSchemaReader = requiredSchemaReader, +requiredSchemaReaderSkipMerging = requiredSchemaReader + ) +} else { + val prunedRequiredSchema = { +val unusedMandatoryColumnNames = mandatoryColumns.filterNot(requestedColumns.contains) +val prunedStructSchema = + StructType(requiredDataSchema.structTypeSchema.fields +.filterNot(f => unusedMandatoryColumnNames.contains(f.name))) + +HoodieTableSchema(prunedStructSchema, convertToAvroSchema(prunedStructSchema, tableName).toString) + } + + val requiredSchemaReaderSkipMerging = createBaseFileReader( +spark = sqlContext.sparkSession, +partitionSchema = partitionSchema, +dataSchema = dataSchema, +requiredDataSchema = prunedRequiredSchema, +// This file-reader is only used in cases when no merging is performed, therefore it's safe to push +// down these filters to the base file readers +filters = requiredFilters ++ optionalFilters, +options = optParams, +// NOTE: We have to fork the Hadoop Config here as Spark will be modifying it +// to configure Parquet reader
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1356025835 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -255,40 +255,47 @@ object DefaultSource { Option.empty } - (tableType, queryType, isBootstrappedTable) match { -case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) | - (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) | - (MERGE_ON_READ, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false) => + val isMultipleBaseFileFormatsEnabled = metaClient.getTableConfig.isMultipleBaseFileFormatsEnabled + (tableType, queryType, isBootstrappedTable, isMultipleBaseFileFormatsEnabled) match { +case (COPY_ON_WRITE, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) | + (COPY_ON_WRITE, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, false, true) => + new HoodieMultiFileFormatCOWRelation(sqlContext, metaClient, parameters, userSchema, globPaths) +case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false, true) | Review Comment: Can we handle the `isMultipleBaseFileFormatsEnabled` serarately? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1356023744 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -821,6 +820,8 @@ public static class PropertyBuilder { private String metadataPartitions; private String inflightMetadataPartitions; private String secondaryIndexesMetadata; +private Boolean multipleBaseFileFormatsEnabled; +private String baseFileFormats; Review Comment: Can you explain in high level why we need this config `baseFileFormats` ? And why it must a table config here, actually we can merge these two variables into one, for exmaple, an empty string of base formats represent `disabled`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1356020748 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java: ## @@ -866,11 +866,11 @@ public void validateInsertSchema() throws HoodieInsertException { } public HoodieFileFormat getBaseFileFormat() { -return metaClient.getTableConfig().getBaseFileFormat(); - } - - public HoodieFileFormat getLogFileFormat() { -return metaClient.getTableConfig().getLogFileFormat(); +HoodieTableConfig tableConfig = metaClient.getTableConfig(); +if (tableConfig.contains(HoodieTableConfig.BASE_FILE_FORMAT)) { + return metaClient.getTableConfig().getBaseFileFormat(); +} +return config.getBaseFileFormat(); Review Comment: Should the table config has higher priority or the vise-versa? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758861709 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294) * 85bf27abe36ef2a6500ed323e64d6598649c95c2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6873) Clustering MOR applies base files after log files
[ https://issues.apache.org/jira/browse/HUDI-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6873. - Fix Version/s: 1.0.0 0.14.1 Resolution: Fixed > Clustering MOR applies base files after log files > - > > Key: HUDI-6873 > URL: https://issues.apache.org/jira/browse/HUDI-6873 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > > If the payload is overwritewithlatestavropayload this matters because if the > base file and the update have the same precombine, then the record in the > base file will be used instead of records from later writes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6873] fix clustering mor (#9774)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 3c439a3a69f [HUDI-6873] fix clustering mor (#9774) 3c439a3a69f is described below commit 3c439a3a69fb88dee551d94c8266e48b7c0d1e8f Author: Jon Vexler AuthorDate: Wed Oct 11 23:04:55 2023 -0400 [HUDI-6873] fix clustering mor (#9774) Currently during clustering of noncompacted mor filegroups with row writer disabled (currently the default for clustering), the records in the base file are applied to the log scanner after the log files have been scanned. If they have the same precombine, the base file records will be chosen over the log file records. This commit mimics the implementation in Iterators.scala to make the behavior consistent. - Co-authored-by: Jonathan Vexler <=> --- .../hudi/common/table/log/CachingIterator.java | 41 +++ .../common/table/log/HoodieFileSliceReader.java| 75 +-- .../hudi/common/table/log/LogFileIterator.java | 57 +++ .../run/strategy/JavaExecutionStrategy.java| 4 +- .../MultipleSparkJobExecutionStrategy.java | 4 +- .../hudi/sink/clustering/ClusteringOperator.java | 3 +- .../TestHoodieSparkMergeOnReadTableClustering.java | 2 +- .../apache/hudi/functional/TestMORDataSource.scala | 85 +- 8 files changed, 243 insertions(+), 28 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java new file mode 100644 index 000..d022b92ae22 --- /dev/null +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/CachingIterator.java @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.log; + +import java.util.Iterator; + +public abstract class CachingIterator implements Iterator { + + protected T nextRecord; + + protected abstract boolean doHasNext(); + + @Override + public final boolean hasNext() { +return nextRecord != null || doHasNext(); + } + + @Override + public final T next() { +T record = nextRecord; +nextRecord = null; +return record; + } + +} diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java index fc3ef4b8d92..1aa2f21fcb2 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java @@ -19,47 +19,80 @@ package org.apache.hudi.common.table.log; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.HoodiePayloadProps; import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieClusteringException; import org.apache.hudi.io.storage.HoodieFileReader; import org.apache.avro.Schema; import java.io.IOException; import java.util.Iterator; +import java.util.Map; import java.util.Properties; -/** - * Reads records from base file and merges any updates from log files and provides iterable over all records in the file slice. - */ -public class HoodieFileSliceReader implements Iterator> { +public class HoodieFileSliceReader extends LogFileIterator { + private Option> baseFileIterator; + private HoodieMergedLogRecordScanner scanner; + private Schema schema; + private Properties props; - private final Iterator> recordsIterator; + private TypedProperties payloadProps = new TypedProperties(); + private Option> simpleKeyGenFieldsOpt; + Map records; + HoodieRecordMerger merger; -
Re: [PR] [HUDI-6873] fix clustering mor [hudi]
codope merged PR #9774: URL: https://github.com/apache/hudi/pull/9774 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6873] fix clustering mor [hudi]
codope commented on PR #9774: URL: https://github.com/apache/hudi/pull/9774#issuecomment-1758839664 Landing this PR. Test failure is unrelated. The integ test failure should be fixed by #9843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1355986341 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieKeyBasedFileGroupRecordBuffer.java: ## @@ -0,0 +1,250 @@ +/* + * + * * Licensed to the Apache Software Foundation (ASF) under one or more + * * contributor license agreements. See the NOTICE file distributed with + * * this work for additional information regarding copyright ownership. + * * The ASF licenses this file to You under the Apache License, Version 2.0 + * * (the "License"); you may not use this file except in compliance with + * * the License. You may obtain a copy of the License at + * * + * *http://www.apache.org/licenses/LICENSE-2.0 + * * + * * Unless required by applicable law or agreed to in writing, software + * * distributed under the License is distributed on an "AS IS" BASIS, + * * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * * See the License for the specific language governing permissions and + * * limitations under the License. + * + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.model.DeleteRecord; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.KeySpec; +import org.apache.hudi.common.engine.HoodieReaderContext; +import org.apache.hudi.common.table.log.block.HoodieDataBlock; +import org.apache.hudi.common.table.log.block.HoodieDeleteBlock; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.avro.Schema; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.Iterator; +import java.util.Map; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; + +public class HoodieKeyBasedFileGroupRecordBuffer implements HoodieFileGroupRecordBuffer { Review Comment: Can we add some doc to these new classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1355983980 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java: ## @@ -102,6 +102,18 @@ public HoodieFileGroupReader(HoodieReaderContext readerContext, this.start = 0; this.length = Long.MAX_VALUE; this.baseFileIterator = new EmptyIterator<>(); +this.shouldUseRecordPosition = false; Review Comment: Oh, I see, there is no need to keep 2 constructors here, we can instantiate the record buffer from it's factory class with the given flag `shouldUseRecordPosition` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1355981308 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java: ## @@ -102,6 +102,18 @@ public HoodieFileGroupReader(HoodieReaderContext readerContext, this.start = 0; this.length = Long.MAX_VALUE; this.baseFileIterator = new EmptyIterator<>(); +this.shouldUseRecordPosition = false; Review Comment: shouldUseRecordPosition is always false now, do we still need to keep it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758826067 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * bfea4593820ae5257f93f822986ef58168f69dde Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289) * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1355980764 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordReader.java: ## @@ -81,11 +80,16 @@ private HoodieMergedLogRecordReader(HoodieReaderContext readerContext, Option partitionName, InternalSchema internalSchema, Option keyFieldOverride, - boolean enableOptimizedLogBlocksScan, HoodieRecordMerger recordMerger) { + boolean enableOptimizedLogBlocksScan, + HoodieRecordMerger recordMerger, + HoodieFileGroupRecordBuffer recordBuffer) { super(readerContext, fs, basePath, logFilePaths, readerSchema, latestInstantTime, readBlocksLazily, reverseReader, bufferSize, -instantRange, withOperationField, forceFullScan, partitionName, internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan, recordMerger); +instantRange, withOperationField, forceFullScan, partitionName, internalSchema, keyFieldOverride, enableOptimizedLogBlocksScan, +recordMerger, recordBuffer); this.records = new HashMap<>(); this.scannedPrefixes = new HashSet<>(); +this.recordBuffer = recordBuffer; Review Comment: Do we need to assign the record buffer 2 times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1758820588 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * bfea4593820ae5257f93f822986ef58168f69dde Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289) * 9262fe65ccfb3d7f74eb0ee35d4f822eeb1a67ea UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758820147 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758813978 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6793: - Assignee: Jonathan Vexler > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6790: - Assignee: Jonathan Vexler > Support incremental read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6790 > URL: https://issues.apache.org/jira/browse/HUDI-6790 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6897) Improve SimpleConcurrentFileWritesConflictResolutionStrategy for NB-CC
[ https://issues.apache.org/jira/browse/HUDI-6897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6897: - Assignee: Sagar Sumit > Improve SimpleConcurrentFileWritesConflictResolutionStrategy for NB-CC > -- > > Key: HUDI-6897 > URL: https://issues.apache.org/jira/browse/HUDI-6897 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > There is no need to throw concurrent modification exception for the simple > strategy under NB-CC, because the compactor would finally resolve the > conflicts instead. > Check test case > {code:java} > TestHoodieClientMultiWriter#testMultiWriterWithAsyncTableServicesWithConflict{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-2461) Support lock free multi-writer for metadata table
[ https://issues.apache.org/jira/browse/HUDI-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-2461: - Assignee: Sagar Sumit (was: Vinoth Chandar) > Support lock free multi-writer for metadata table > - > > Key: HUDI-2461 > URL: https://issues.apache.org/jira/browse/HUDI-2461 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata, multi-writer, writer-core >Reporter: sivabalan narayanan >Assignee: Sagar Sumit >Priority: Critical > Fix For: 1.0.0 > > > Even with synchronous patch, we instantiate metadata table with single writer > mode only. > But we need to support async compaction and cleaning and hence we need to > think about supporting multi-writer down the line. > Details: > all writes to metadata table happens within data table lock, including > compaction and cleaning in metadata table since we do inline. But as we scale > metadata table infra w/ more indexes, we need to support async compaction and > cleaning and so we need multi-writer support. > One possibility: > - Special transaction management for metadata table. > data table commits: all writes to metadata table will be guarded by datatable > lock (regular writes, clustering, compaction, everything). regular writes > will do usual conflict resolution, where as compaction and clustering may > not. > Now coming to metadata table commits, there won't be any conflict resolution > in general for whole of metadata table. But we will ensure any commit happens > by acquiring a lock. Our presumption is that, all the conflict resolution > would have happened within data table before proceeding to make a commit in > metadata table and so we don't need to do any conflict resolution > specifically. > Scheduling of compaction and cleaning will happen along w/ regular upserts. > and we will have async compaction and cleaning support. so, when these async > operations are looking to commit in metadata table, they will acquire lock, > make the commit and release the lock. Only one writer will be in progress > during metadata commit. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6480) Flink lockless multi-writer
[ https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6480: - Assignee: Danny Chen > Flink lockless multi-writer > --- > > Key: HUDI-6480 > URL: https://issues.apache.org/jira/browse/HUDI-6480 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6480) Flink lockless multi-writer
[ https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6480: -- Status: In Progress (was: Open) > Flink lockless multi-writer > --- > > Key: HUDI-6480 > URL: https://issues.apache.org/jira/browse/HUDI-6480 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6480) Flink lockless multi-writer
[ https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6480: -- Fix Version/s: 1.0.0 (was: 0.14.1) > Flink lockless multi-writer > --- > > Key: HUDI-6480 > URL: https://issues.apache.org/jira/browse/HUDI-6480 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6781) Add deltacommit timestamp to log file name
[ https://issues.apache.org/jira/browse/HUDI-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6781. - Resolution: Done > Add deltacommit timestamp to log file name > -- > > Key: HUDI-6781 > URL: https://issues.apache.org/jira/browse/HUDI-6781 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6641) Remove the log append and always uses the current instant time in file name
[ https://issues.apache.org/jira/browse/HUDI-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6641. - Resolution: Done > Remove the log append and always uses the current instant time in file name > --- > > Key: HUDI-6641 > URL: https://issues.apache.org/jira/browse/HUDI-6641 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
stream2000 commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758786265 @danny0405 Hi Danny, I have addressed all comments, PTAL~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
stream2000 commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1758785514 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758771641 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]
danny0405 commented on issue #9172: URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758765832 Here is the fix I found: https://github.com/apache/hudi/pull/8082 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]
danny0405 commented on issue #9172: URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758764305 cc @CTTY from the AWS team, do you have any thought that can help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unstable Execution Time and Many RequestHandler WARN Logs [hudi]
danny0405 commented on issue #8100: URL: https://github.com/apache/hudi/issues/8100#issuecomment-1758753847 Did you ever try the latest release, the fs view should perform better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758735292 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 57442675289e3db3252449725380a72977f7e7fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290) * c8b824bf84288173cdff2b95e4115869154417fe Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20293) * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20294) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758729502 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 57442675289e3db3252449725380a72977f7e7fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290) * c8b824bf84288173cdff2b95e4115869154417fe Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20293) * 5415f318a8e991befbdc459b1f4d1f9bfb796c07 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1758696229 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 57442675289e3db3252449725380a72977f7e7fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290) * c8b824bf84288173cdff2b95e4115869154417fe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi MERGE INTO on Glue fails when using functions such as (filter, zip_with) on array of structs [hudi]
rita-ihnatsyeva commented on issue #9838: URL: https://github.com/apache/hudi/issues/9838#issuecomment-1758607355 @ad1happy2go I tried your code in prod env, it works fine, so I guess smth wrong with my input data, as for now I can't understand what's wrong. Doesn't seem like a reproducible bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.ClassCastException with incremental query [hudi]
bkosuru commented on issue #9172: URL: https://github.com/apache/hudi/issues/9172#issuecomment-1758605903 Same exception with Hudi 0.14.0 and Spark 3.3.2. (GCP serverless 1.1) @danny0405 you said the issue should be resolved with Hudi 0.14.0. Do you know why it is still broken? Works fine with Hudi 0.14.0 and Spark 3.4.0 though. We have to upgrade to GCP serverless 2.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Error when running a pipeline after an interrupt [hudi]
ingridymartinss commented on issue #9518: URL: https://github.com/apache/hudi/issues/9518#issuecomment-1758174848 We haven't figured it out yet. :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
linliu-code commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1355445308 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java: ## @@ -51,4 +51,11 @@ public class HoodieReaderConfig { .sinceVersion("0.13.0") .withDocumentation("New optimized scan for log blocks that handles all multi-writer use-cases while appending to log files. " + "It also differentiates original blocks written by ingestion writers and compacted blocks written log compaction."); + + public static final ConfigProperty FILE_GROUP_READER_ENABLED = ConfigProperty + .key("hoodie.file.group.reader.enabled") + .defaultValue(true) + .markAdvanced() + .sinceVersion("1.0.0") Review Comment: I put it as true just for testing purpose, i will make it false. Sorry for the confusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6928) Support position based merging in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774132#comment-17774132 ] Lin Liu commented on HUDI-6928: --- The PR is ready, but need to address some critical comments on refactoring. > Support position based merging in HoodieFileGroupReader > --- > > Key: HUDI-6928 > URL: https://issues.apache.org/jira/browse/HUDI-6928 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6928) Support position based merging in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6928: -- Status: Patch Available (was: In Progress) > Support position based merging in HoodieFileGroupReader > --- > > Key: HUDI-6928 > URL: https://issues.apache.org/jira/browse/HUDI-6928 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6928) Support position based merging in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6928: -- Status: In Progress (was: Open) > Support position based merging in HoodieFileGroupReader > --- > > Key: HUDI-6928 > URL: https://issues.apache.org/jira/browse/HUDI-6928 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6786) Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query
[ https://issues.apache.org/jira/browse/HUDI-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6786: -- Status: Patch Available (was: In Progress) > Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR > Snapshot Query > -- > > Key: HUDI-6786 > URL: https://issues.apache.org/jira/browse/HUDI-6786 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > Goal: When `NewHoodieParquetFileFormat` is enabled with > `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR > Snapshot query should use HoodieFileGroupReader. All relevant tests on basic > MOR snapshot query should pass (except for the caveats in the current > HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in > this EPIC). > The query logic is implemented in > `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the > following code for MOR snapshot query: > {code:java} > else { > if (logFiles.nonEmpty) { > val baseFile = createPartitionedFile(InternalRow.empty, > hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen) > buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, > filePath.getParent, requiredSchemaWithMandatory, > requiredSchemaWithMandatory, outputSchema, partitionSchema, > partitionValues, broadcastedHadoopConf.value.value) > } else { > throw new IllegalStateException("should not be here since file slice > should not have been broadcasted since it has no log or data files") > //baseFileReader(baseFile) > } {code} > `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, > with a new config `hoodie.read.use.new.file.group.reader`, by passing in the > correct base and log file list. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757979131 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]
hudi-bot commented on PR #9843: URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757960568 ## CI report: * 6b231840828f6b70965f4015de976500918f5703 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757957453 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan closed issue #9827: MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider URL: https://github.com/apache/hudi/issues/9827 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757888560 Thank you for helping me. The problem has been resolved, so I will close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757878099 > m1 chip does not have good compatibility, can you try arm64 chip instead? thank you,give more suggestions. M1 chip is based on the arm64 architecture. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
stream2000 commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757863389 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757864136 I analyzed the reason for this issue based on the source code,sharing here hopes to be helpful to others . In master branch,when you create docker contain.It copy docker/demo and docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar into directory /var/hoodie/ws/docker/.. in adhoc-2 contain.So execute spark-submit, it use /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar. But the jar may compiled by me in branch 13.0 code.So it should be caused by a version mismatch, because I did not compile the master branch code locally successfully, and my local hoodie-utilities.jar was previously compiled using branch 13.0. https://github.com/apache/hudi/assets/19281198/12a3e983-344d-44bb-967e-18b1d270660c;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]
ad1happy2go commented on issue #9845: URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757851940 @noahtaite Yes, Converting to integer type before saving will work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757850550 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757792367 > @hanrongMan I was able to run the complete docker demo with 0.14.0 on M1 and didn't faced any issues. What build command you are using? As default profiles have changes, can you try with this - > > mvn -U clean package -Pintegration-tests -DskipTests -Dscala-2.11 -Dspark2.4 thank you . I switched to branch 14.0, compiled using your command, and run successfully. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]
noahtaite commented on issue #9845: URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757792128 Hello @danny0405 we are ingesting from ~300 tables in ~2k customer databases which are not fully constrained - we can expect to see null values in many fields. In this case I believe it is a missing linking ID. @ad1happy2go happy to hear you have reproduced the issue, looking forward to hearing about a workaround and timeline for fix. Two things to note 1 - I also reproduced this issue with ByteType which it seems that Hudi is handling exact same as ShortType 2 - Our current workaround (temp + hacky) is to convert all incoming ShortType + ByteType to IntegerType before saving to Hudi. This is working in our dev environment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757789513 > @hanrongMan I was able to run the complete docker demo with 0.14.0 on M1 and didn't faced any issues. What build command you are using? As default profiles have changes, can you try with this - > > mvn -U clean package -Pintegration-tests -DskipTests -Dscala-2.11 -Dspark2.4 thank you . I switched to branch 14.0, compiled using your command, and run successfully. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan commented on issue #9827: URL: https://github.com/apache/hudi/issues/9827#issuecomment-1757781453 > m1 chip does not have good compatibility, can you try arm64 chip instead? apple M1 chip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider [hudi]
hanrongMan closed issue #9827: MacOs M1 Exception in thread "main" java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider URL: https://github.com/apache/hudi/issues/9827 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Glue 4.0 Hudi 0.12.1 PreCommit validator i.e SqlQueryEqualityPreCommitValidator is not working [hudi]
ad1happy2go commented on issue #9183: URL: https://github.com/apache/hudi/issues/9183#issuecomment-1757756659 @abhisheksahani91 Do you have any more issues/doubts around this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Doubt about handling old data arrival in hudi [hudi]
codope closed issue #8576: [SUPPORT] Doubt about handling old data arrival in hudi URL: https://github.com/apache/hudi/issues/8576 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Docker Demo Issue With Current master(0.14.0-SNAPSHOT) [hudi]
codope closed issue #8447: [SUPPORT] Docker Demo Issue With Current master(0.14.0-SNAPSHOT) URL: https://github.com/apache/hudi/issues/8447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Docker Demo Issue With Current master(0.14.0-SNAPSHOT) [hudi]
ad1happy2go commented on issue #8447: URL: https://github.com/apache/hudi/issues/8447#issuecomment-1757712285 @agrawalreetika I confirmed, docker demo is working fine with latest release 0.14.0 also. Closing this issue. Please reopen in case of any concerns. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Slow performance when inserting to a lot of partitions and metadata enabled [hudi]
VitoMakarevich opened a new issue, #9848: URL: https://github.com/apache/hudi/issues/9848 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Recently we enabled metadata for one big table. This table includes ~700k partitions in the test environment and the usual insert affects ~600. Also it's important to say it's even more pressing if we have enabled the timeline server. The issue we have is visible in the slow `getting small files` stage - e.g. for 600 insert-affected partitions(600 tasks in this stage) it takes ~2-3 minutes, and for 12k tasks(partitions affected) - ~40 minutes. important details: 1. Metadata is enabled 2. Timeline server is enabled 3. Metadata table `hfile` file size - about 40MB 4. Number of log files: "hoodie.metadata.compact.max.delta.commits" = "1" 5. Only file listing is enabled in metadata I dug into this a lot and enabled detailed logs and this is what I think maybe the issue: In logs "Metadata read for %s keys took [baseFileRead, logMerge] %s ms" - half of the values are single digit numbers, half are > 1000, e.g. Metadata read for 1 keys took [baseFileRead, logMerge] [0, 12456] ms. There are also logs like `Updating metadata metrics (basefile_read.totalDuration=12155)` `Updating metadata metrics (lookup_files.totalDuration=12156)` So my suspect is that given such a large metadata `file`(40mb), it looks like hfile lookup is suboptimal. Do you aware of any issue similar to this(in 0.12.1 and 0.12.2)? As I see in the code it seeks up to a partition path, may it happen that readers are somehow now being reused well, so these 40mb files are seeked again and again thousands of times? As a remediate, we turned off the embed server, so as I understand same thing will be done on executors, but it will be less problematic since parallelism is much bigger. **To Reproduce** Steps to reproduce the behavior: Probably generating a large set of partitions(up to 30-40 MB hfile size) and running insert to thousands of partitions may reproduce it. **Expected behavior** `getting small files` should not be so long. Without metadata + embed server it takes ~40sec to check 12k partitions, while with metadata it's 40+ minutes. **Environment Description** * Hudi version : 0.12.1-0.12.2 * Spark version : 3.3.0-3.3.1 * Hive version : * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Additional context** I can try to make a reproduction if you don't know about anything like this. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]
hudi-bot commented on PR #9843: URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757693957 ## CI report: * 4a77d04deabcc24baa73f35a64509f86fd84d03c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20283) * 6b231840828f6b70965f4015de976500918f5703 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6917] Fix docker integ tests [hudi]
hudi-bot commented on PR #9843: URL: https://github.com/apache/hudi/pull/9843#issuecomment-1757597868 ## CI report: * 4a77d04deabcc24baa73f35a64509f86fd84d03c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20283) * 6b231840828f6b70965f4015de976500918f5703 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757581642 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * bfea4593820ae5257f93f822986ef58168f69dde Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757510755 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 57442675289e3db3252449725380a72977f7e7fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757505901 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286) * ee3bbd6595f8a69ecaf53d9ac2b445533958832c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757486587 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286) * ee3bbd6595f8a69ecaf53d9ac2b445533958832c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6924] Fix hoodie table config not wok in table properties [hudi]
boneanxs commented on code in PR #9836: URL: https://github.com/apache/hudi/pull/9836#discussion_r1354717038 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala: ## @@ -199,7 +184,7 @@ object HoodieOptionConfig { // extract primaryKey, preCombineField, type options def extractSqlOptions(options: Map[String, String]): Map[String, String] = { -val sqlOptions = mapTableConfigsToSqlOptions(options) +val sqlOptions = mapHoodieOptionsToSqlOptions(options) Review Comment: `mapHoodieConfigsToSqlOptions` should be more accurate? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala: ## @@ -169,22 +170,6 @@ object HoodieOptionConfig { .toMap } - /** - * Get the table type from the table options. - * @param options - * @return - */ - def getTableType(options: Map[String, String]): String = { Review Comment: No need to delete this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757389269 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287) * 57442675289e3db3252449725380a72977f7e7fa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20290) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757367948 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287) * 57442675289e3db3252449725380a72977f7e7fa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757366765 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * 2543e0a84a337b02e4af14a3f8b2c4dafbd6d558 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20266) * bfea4593820ae5257f93f822986ef58168f69dde Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
hudi-bot commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757346235 ## CI report: * 8821b701b0ae25d6bcf17dd95f36be9cc8de084b UNKNOWN * 2543e0a84a337b02e4af14a3f8b2c4dafbd6d558 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20266) * bfea4593820ae5257f93f822986ef58168f69dde UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]PrestoDB failed to query data from mor table [hudi]
ChandrasekharPo-Kore commented on issue #8078: URL: https://github.com/apache/hudi/issues/8078#issuecomment-1757345076 Any progress on this one. The query works on _ro but fails with this error on _rt This is forcing us to use cow table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757344904 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
boneanxs commented on code in PR #9617: URL: https://github.com/apache/hudi/pull/9617#discussion_r1354592241 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimeGeneratorBase.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.common.config.HoodieTimeGeneratorConfig; +import org.apache.hudi.common.config.LockConfiguration; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.lock.LockProvider; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.exception.HoodieLockException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.concurrent.TimeUnit; + +import static org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS; +import static org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_NUM_RETRIES; +import static org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS; +import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_CLIENT_NUM_RETRIES_PROP_KEY; +import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY; +import static org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY; + +/** + * Base time generator facility that maintains lock-related utilities. + */ +public abstract class TimeGeneratorBase implements TimeGenerator, Serializable { + + private static final Logger LOG = LoggerFactory.getLogger(TimeGeneratorBase.class); + + /** + * The lock provider. + */ + private volatile LockProvider lockProvider; + /** + * The maximum times to retry in case there are failures. + */ + private final int maxRetries; + /** + * The maximum time to wait for each time generation to resolve the clock skew issue on distributed hosts. + */ + private final long maxWaitTimeInMs; + /** + * The maximum time to block for acquiring a lock. + */ + private final int lockAcquireWaitTimeInMs; + + protected final HoodieTimeGeneratorConfig config; + private final LockConfiguration lockConfiguration; + + /** + * The hadoop configuration. + */ + private final SerializableConfiguration hadoopConf; + + public TimeGeneratorBase(HoodieTimeGeneratorConfig config, SerializableConfiguration hadoopConf) { +this.config = config; +this.lockConfiguration = config.getLockConfiguration(); +this.hadoopConf = hadoopConf; + +maxRetries = lockConfiguration.getConfig().getInteger(LOCK_ACQUIRE_CLIENT_NUM_RETRIES_PROP_KEY, +Integer.parseInt(DEFAULT_LOCK_ACQUIRE_NUM_RETRIES)); +lockAcquireWaitTimeInMs = lockConfiguration.getConfig().getInteger(LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY, +Integer.parseInt(DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS)); +maxWaitTimeInMs = lockConfiguration.getConfig().getLong(LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY, +Long.parseLong(DEFAULT_LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS)); + } + + protected LockProvider getLockProvider() { +// Perform lazy initialization of lock provider only if needed +if (lockProvider == null) { + synchronized (this) { +if (lockProvider == null) { + String lockProviderClass = lockConfiguration.getConfig().getString("hoodie.write.lock.provider"); + LOG.info("LockProvider for TimeGenerator: " + lockProviderClass); + lockProvider = (LockProvider) ReflectionUtils.loadClass(lockProviderClass, + lockConfiguration, hadoopConf.get()); +} + } +} +return lockProvider; + } + + public void lock() { Review Comment: Use `RetryHelper` to simplify code here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-1623] Solid completion time on timeline [hudi]
boneanxs commented on PR #9617: URL: https://github.com/apache/hudi/pull/9617#issuecomment-1757292134 > Looks good overall, is there anyway we can abstract that failure retries as a common utility? Add add a UT for TimeGenerator. done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
danny0405 commented on code in PR #9118: URL: https://github.com/apache/hudi/pull/9118#discussion_r1354583732 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java: ## @@ -488,11 +505,24 @@ private void flushRemaining(boolean endInput) { this.writeStatuses.addAll(writeStatus); // blocks flushing until the coordinator starts a new instant this.confirming = true; + +writeMetrics.endFlushing(); +writeMetrics.resetAfterCommit(); + } + + private void registerMetrics() { +MetricGroup metrics = getRuntimeContext().getMetricGroup(); +writeMetrics = new FlinkStreamWriteMetrics(metrics); +writeMetrics.registerMetrics(); } protected List writeBucket(String instant, DataBucket bucket, List records) { bucket.preWrite(records); -return writeFunction.apply(records, instant); +writeMetrics.startHandleClose(); +List statuses = writeFunction.apply(records, instant); Review Comment: Maybe we just rename `checkpointFlush` -> `dataFlush` and `singleFileFlush` -> `fileFlush` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
danny0405 commented on code in PR #9118: URL: https://github.com/apache/hudi/pull/9118#discussion_r1354581902 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java: ## @@ -155,5 +168,15 @@ private void flushData(boolean endInput) { this.writeStatuses.addAll(writeStatus); // blocks flushing until the coordinator starts a new instant this.confirming = true; + +writeMetrics.endCheckpointFlushing(); +LOG.info("Flushing costs: {} ms", writeMetrics.getCheckpointFlushCosts()); +writeMetrics.resetAfterCommit(); Review Comment: We better avoid the logging for each data flush. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Unstable Execution Time and Many RequestHandler WARN Logs [hudi]
lovemylover042 commented on issue #8100: URL: https://github.com/apache/hudi/issues/8100#issuecomment-1757266624 @danny0405 I found delta commit became so slowly because it use secondary filesystem view when got a bad response from remote timeline server. I think the bad response was caused by compaction running at the same time and timeline server was behind the client. Can i force sync local view if timeline server was behind the client ? - org.apache.hudi.timeline.service.RequestHandler line 501: // TODO: set refreshCheck to be true when timeline server became behind several times or some seconds if (refreshCheck) { long beginFinalCheck = System.currentTimeMillis(); if (isLocalViewBehind(context)) { String errMsg = "Last known instant from client was " + context.queryParam(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, HoodieTimeline.INVALID_INSTANT_TS) + " but server has the following timeline " + viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM)) .getTimeline().getInstants().collect(Collectors.toList()); throw new BadRequestResponse(errMsg); } long endFinalCheck = System.currentTimeMillis(); finalCheckTimeTaken = endFinalCheck - beginFinalCheck; } - Environment Description: Hudi version : 0.10.1 Spark version : 3.0.1 Hadoop version : 3.1.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1354565693 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java: ## @@ -51,4 +51,11 @@ public class HoodieReaderConfig { .sinceVersion("0.13.0") .withDocumentation("New optimized scan for log blocks that handles all multi-writer use-cases while appending to log files. " + "It also differentiates original blocks written by ingestion writers and compacted blocks written log compaction."); + + public static final ConfigProperty FILE_GROUP_READER_ENABLED = ConfigProperty + .key("hoodie.file.group.reader.enabled") + .defaultValue(true) + .markAdvanced() + .sinceVersion("1.0.0") Review Comment: Does this option exist because we do not have enough confidence that the new reader would cover all the read use cases? Or not sure it is robust enough? I see the default value is true, that means user would anyway encounter these problems, can we address them and just remove this option? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757244622 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan
[ https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei closed HUDI-6714. -- Resolution: Won't Do already has this parameter to enable such feature hoodie.compact.schedule.inline > HoodieStreamer support only schedule the compaction plan but not execute the > plan > - > > Key: HUDI-6714 > URL: https://issues.apache.org/jira/browse/HUDI-6714 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Major > > For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction > mode can be *async.* > In the async compaction mode, the hoodie-streamer will schedule one > compaction plan after each write operation and execute compaction plan if > need. But the execution of compaction will share the spark job resource, > which may cause the write delay. > In our cases, we want to execute the compaction offline to save the spark > resource of streamer and reduce the write latency. And we found that > scheduling the compaction plan offline will fail while streamer is writing > (means we have to stop the streamer in order to schedule the plan offline). > So we want the streamer only to schedule the compaction plan but not to > execute it. > But currently the streamer seems not support such case. If we set the > `--disable-compaction` to false, the streamer will not schedule the > compaction plan anymore. > So I want to add a param named --{_}enable-schedule-compaction{_} in the > streamer, > and we can set --{_}disable-compaction{_}=false and > {_}enable-schedule-compaction{_}=true to enable only schedule the compaction > in streamer. > the cases like below: > ||--disable-compaction||--enable-schedule-compaction||schedule plan||execute > plan|| > |true|true or false|true|true| > |false|true|true|false| > |false|false|false|false| > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Enable metadata table, Spark write mor table duplicate data [hudi]
ad1happy2go commented on issue #9714: URL: https://github.com/apache/hudi/issues/9714#issuecomment-1757232992 ``` inserts.write.format("org.apache.hudi") .option("hoodie.datasource.write.recordkey.field", "uuid") .option("hoodie.datasource.write.partitionpath.field", "city") .option("hoodie.parquet.small.file.limit", 128) .option("hoodie.datasource.hive_sync.partition_fields", "city") .option("hoodie.upsert.shuffle.parallelism", 200) .option("hoodie.datasource.write.operation", "upsert") .option("hoodie.datasource.write.precombine.field", "ts") .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") .option("hoodie.embed.timeline.server", true) .option("hoodie.datasource.write.streaming.ignore.failed.batch", false) .option("hoodie.cleaner.commits.retained", "15") .option("hoodie.datasource.hive_sync.table_properties", "spark.sql.partitionProvider=catalog") .option("hoodie.keep.min.commits", 25) .option("hoodie.keep.max.commits", 30) .option("hoodie.clean.async", false) .option("hoodie.table.name", tableName) .option("hoodie.datasource.write.payload.class", "org.apache.hudi.common.model.DefaultHoodieRecordPayload") .option("hoodie.payload.event.time.field", "ts") .option("hoodie.payload.ordering.field", "ts") .option("hoodie.write.markers.type", "DIRECT") .option("hoodie.metadata.enable", "true") .option("hoodie.metadata.index.bloom.filter.enable", "true") .option("hoodie.metadata.index.column.stats.enable", "true") .option("hoodie.metadata.index.column.stats.column.list", "uuid") .option("hoodie.bloom.index.use.metadata", "true") .mode(Append) .save(basePath); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan
[ https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei updated HUDI-6714: --- Description: For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction mode can be *async.* In the async compaction mode, the hoodie-streamer will schedule one compaction plan after each write operation and execute compaction plan if need. But the execution of compaction will share the spark job resource, which may cause the write delay. In our cases, we want to execute the compaction offline to save the spark resource of streamer and reduce the write latency. And we found that scheduling the compaction plan offline will fail while streamer is writing (means we have to stop the streamer in order to schedule the plan offline). So we want the streamer only to schedule the compaction plan but not to execute it. But currently the streamer seems not support such case. If we set the `--disable-compaction` to false, the streamer will not schedule the compaction plan anymore. So I want to add a param named --{_}enable-schedule-compaction{_} in the streamer, and we can set --{_}disable-compaction{_}=false and {_}enable-schedule-compaction{_}=true to enable only schedule the compaction in streamer. the cases like below: ||--disable-compaction||--enable-schedule-compaction||schedule plan||execute plan|| |true|true or false|true|true| |false|true|true|false| |false|false|false|false| was: For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction mode can be *async.* In the async compaction mode, the hoodie-streamer will schedule one compaction plan after each write operation and execute compaction plan if need. But the execution of compaction will share the spark job resource, which may cause the write delay. In our cases, we want to execute the compaction offline to save the spark resource for streamer and reduce the write latency. And we found that scheduling the compaction plan offline will fail while streamer is writing (means we have to stop the streamer in order to schedule the plan offline). So we only want the streamer to schedule the compaction but not to execute it. But currently the streamer seems not support such case. If we set the `--disable-compaction` to false, the streamer will not schedule the compaction anymore. So I want to add a param named --{_}enable-schedule-compaction{_} in the streamer, and we can set --{_}disable-compaction{_}=false and {_}enable-schedule-compaction{_}=true to enable only schedule the compaction in streamer. the cases like below: ||param case||schedule plan||execute plan|| |--disable-compaction = true no matter --enable-schedule-compaction|true|true| |--disable-compaction = false --enable-schedule-compaction = true|true|false| |--disable-compaction = false --enable-schedule-compaction = false|false|false| > HoodieStreamer support only schedule the compaction plan but not execute the > plan > - > > Key: HUDI-6714 > URL: https://issues.apache.org/jira/browse/HUDI-6714 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Major > > For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction > mode can be *async.* > In the async compaction mode, the hoodie-streamer will schedule one > compaction plan after each write operation and execute compaction plan if > need. But the execution of compaction will share the spark job resource, > which may cause the write delay. > In our cases, we want to execute the compaction offline to save the spark > resource of streamer and reduce the write latency. And we found that > scheduling the compaction plan offline will fail while streamer is writing > (means we have to stop the streamer in order to schedule the plan offline). > So we want the streamer only to schedule the compaction plan but not to > execute it. > But currently the streamer seems not support such case. If we set the > `--disable-compaction` to false, the streamer will not schedule the > compaction plan anymore. > So I want to add a param named --{_}enable-schedule-compaction{_} in the > streamer, > and we can set --{_}disable-compaction{_}=false and > {_}enable-schedule-compaction{_}=true to enable only schedule the compaction > in streamer. > the cases like below: > ||--disable-compaction||--enable-schedule-compaction||schedule plan||execute > plan|| > |true|true or false|true|true| > |false|true|true|false| > |false|false|false|false| > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6714) HoodieStreamer support only schedule the compaction plan but not execute the plan
[ https://issues.apache.org/jira/browse/HUDI-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei updated HUDI-6714: --- Description: For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction mode can be *async.* In the async compaction mode, the hoodie-streamer will schedule one compaction plan after each write operation and execute compaction plan if need. But the execution of compaction will share the spark job resource, which may cause the write delay. In our cases, we want to execute the compaction offline to save the spark resource for streamer and reduce the write latency. And we found that scheduling the compaction plan offline will fail while streamer is writing (means we have to stop the streamer in order to schedule the plan offline). So we only want the streamer to schedule the compaction but not to execute it. But currently the streamer seems not support such case. If we set the `--disable-compaction` to false, the streamer will not schedule the compaction anymore. So I want to add a param named --{_}enable-schedule-compaction{_} in the streamer, and we can set --{_}disable-compaction{_}=false and {_}enable-schedule-compaction{_}=true to enable only schedule the compaction in streamer. the cases like below: ||param case||schedule plan||execute plan|| |--disable-compaction = true no matter --enable-schedule-compaction|true|true| |--disable-compaction = false --enable-schedule-compaction = true|true|false| |--disable-compaction = false --enable-schedule-compaction = false|false|false| was: For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction mode can *async.* In the async compaction mode, the hoodie-streamer will schedule one compaction plan after each write operation and execute compaction plan if need. But the execution of compaction will share the spark job resource, which may cause the write delay. In our cases, we want to execute the compaction offline to save the spark resource for streamer and reduce the write latency. And we found that scheduling the compaction plan offline will fail while streamer is writing (means we have to stop the streamer in order to schedule the plan offline). So we only want the streamer to schedule the compaction but not to execute it. But currently the streamer seems not support such case. If we set the `--disable-compaction` to false, the streamer will not schedule the compaction anymore. So I want to add a param named --{_}enable-schedule-compaction{_} in the streamer, and we can set --{_}disable-compaction{_}=false and {_}enable-schedule-compaction{_}=true to enable only schedule the compaction in streamer. the cases like below: ||param case||schedule plan||execute plan|| |--disable-compaction = true no matter --enable-schedule-compaction|true|true| |--disable-compaction = false --enable-schedule-compaction = true|true|false| |--disable-compaction = false --enable-schedule-compaction = false|false|false| > HoodieStreamer support only schedule the compaction plan but not execute the > plan > - > > Key: HUDI-6714 > URL: https://issues.apache.org/jira/browse/HUDI-6714 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Major > > For HoodieStreamer(aka HoodieDeltaStreamer) writing MOR table, the compaction > mode can be *async.* > In the async compaction mode, the hoodie-streamer will schedule one > compaction plan after each write operation and execute compaction plan if > need. But the execution of compaction will share the spark job resource, > which may cause the write delay. > In our cases, we want to execute the compaction offline to save the spark > resource for streamer and reduce the write latency. And we found that > scheduling the compaction plan offline will fail while streamer is writing > (means we have to stop the streamer in order to schedule the plan offline). > So we only want the streamer to schedule the compaction but not to execute it. > But currently the streamer seems not support such case. If we set the > `--disable-compaction` to false, the streamer will not schedule the > compaction anymore. > So I want to add a param named --{_}enable-schedule-compaction{_} in the > streamer, > and we can set --{_}disable-compaction{_}=false and > {_}enable-schedule-compaction{_}=true to enable only schedule the compaction > in streamer. > the cases like below: > ||param case||schedule plan||execute plan|| > |--disable-compaction = true > no matter --enable-schedule-compaction|true|true| > |--disable-compaction = false > --enable-schedule-compaction = true|true|false| > |--disable-compaction = false > --enable-schedule-compaction = false|false|false| > -- This
[hudi] branch master updated: [HUDI-6925] Do not list all partitions for 'alter table drop partition' (#9837)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 21a8a3c3693 [HUDI-6925] Do not list all partitions for 'alter table drop partition' (#9837) 21a8a3c3693 is described below commit 21a8a3c3693d550005b098f833630c6af8106aa7 Author: StreamingFlames <18889897...@163.com> AuthorDate: Wed Oct 11 03:17:21 2023 -0500 [HUDI-6925] Do not list all partitions for 'alter table drop partition' (#9837) --- .../scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala index d5f46936be5..9751624e3bf 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala @@ -326,15 +326,7 @@ object HoodieSqlCommonUtils extends SparkAdapterSupport { def getPartitionPathToDrop( hoodieCatalogTable: HoodieCatalogTable, normalizedSpecs: Seq[Map[String, String]]): String = { -val table = hoodieCatalogTable.table -val allPartitionPaths = hoodieCatalogTable.getPartitionPaths -val enableHiveStylePartitioning = isHiveStyledPartitioning(allPartitionPaths, table) -val enableEncodeUrl = isUrlEncodeEnabled(allPartitionPaths, table) -val partitionFields = hoodieCatalogTable.partitionFields -val partitionsToDrop = normalizedSpecs.map( - makePartitionPath(partitionFields, _, enableEncodeUrl, enableHiveStylePartitioning) -).mkString(",") -partitionsToDrop +normalizedSpecs.map(makePartitionPath(hoodieCatalogTable, _)).mkString(",") } private def makePartitionPath(partitionFields: Seq[String],
Re: [PR] [HUDI-6925] Do not list all partitions for 'alter table drop partition' [hudi]
leesf merged PR #9837: URL: https://github.com/apache/hudi/pull/9837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
linliu-code commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1354336953 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java: ## @@ -146,28 +154,52 @@ public void initRecordIterators() { * @return {@code true} if the next record exists; {@code false} otherwise. * @throws IOException on reader error. */ - public boolean hasNext() throws IOException { + public boolean hasNext() { +// Merge records from base file and log files. +int baseFileSequenceNo = 0; while (baseFileIterator.hasNext()) { T baseRecord = baseFileIterator.next(); - String recordKey = readerContext.getRecordKey(baseRecord, readerState.baseFileAvroSchema); - Pair, Map> logRecordInfo = logFileRecordMapping.remove(recordKey); - Option resultRecord = logRecordInfo != null - ? merge(Option.of(baseRecord), Collections.emptyMap(), logRecordInfo.getLeft(), logRecordInfo.getRight()) - : merge(Option.empty(), Collections.emptyMap(), Option.of(baseRecord), Collections.emptyMap()); + Pair, Map> logRecordInfo; + + if (shouldUseRecordPosition) { Review Comment: Good point. I also feel if-else is not very clean for new logic. After adding partial-merging, we will have many readers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757040548 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 3f5ff2bfb0f3446718546765a9838d595317f748 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20285) * e345e838d5daa8c25475e3b12e149d7e5abc5229 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20287) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757038861 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * a9b387e611bdc9c492a27c6adffe2bf74662be96 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19956) * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20286) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs [hudi]
wombatu-kun commented on code in PR #5667: URL: https://github.com/apache/hudi/pull/5667#discussion_r1354321206 ## rfc/rfc-54/rfc-54.md: ## @@ -0,0 +1,175 @@ + + +# RFC-54: New Table APIs and Streamline Hudi Configs + +## Proposers + +- @codope + +## Approvers + +- @xushiyan +- @vinothchandar + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +Users configure jobs to write Hudi tables and control the behaviour of their +jobs at different levels such as table, write client, datasource, record Review Comment: I thought there is kinda naming convention in community: prefix "hudi" - is for project and its submodules, but "hoodie" - is for classes. May be it is better don't break this rule and do not use HudiTable as a class name? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
hudi-bot commented on PR #9819: URL: https://github.com/apache/hudi/pull/9819#issuecomment-1757026871 ## CI report: * a4985db0ce22fb4b4f2518ed70bd96890024a08b UNKNOWN * 3f5ff2bfb0f3446718546765a9838d595317f748 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20285) * e345e838d5daa8c25475e3b12e149d7e5abc5229 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-2141] Support flink stream write metrics [hudi]
hudi-bot commented on PR #9118: URL: https://github.com/apache/hudi/pull/9118#issuecomment-1757025319 ## CI report: * f6d7dd97c73898206da91b17144326a7dbbffae8 UNKNOWN * c62db1fdf94ee2c1f9b9e539f7a4b1bb866beb7e UNKNOWN * a9b387e611bdc9c492a27c6adffe2bf74662be96 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19956) * 8d732e29104fbde138b6ab3fe6df8fb63e10ab07 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs [hudi]
wombatu-kun commented on code in PR #5667: URL: https://github.com/apache/hudi/pull/5667#discussion_r1354314469 ## rfc/rfc-54/rfc-54.md: ## @@ -0,0 +1,183 @@ + + +# RFC-54: New Table APIs and Streamline Hudi Configs + +## Proposers + +- @codope + +## Approvers + +- @xushiyan +- @vinothchandar + +## Status + +JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) + +## Abstract + +Users configure jobs to write Hudi tables and control the behaviour of their +jobs at different levels such as table, write client, datasource, record +payload, etc. On one hand, this is the true strength of Hudi which makes it +suitable for many use cases and offers the users a solution to the tradeoffs +encountered in data systems. On the other, it has also resulted in the learning +curve for new users to be steeper. In this RFC, we propose to streamline some of +these configurations. Additionally, we propose a few table level APIs to create +or update Hudi table programmatically. Together, they would help in a smoother +onboarding experience and increase the usability of Hudi. It would also help +existing users through better configuration maintenance. + +## Background + +Currently, users can create and update Hudi Table using three different +ways: [Spark datasource](https://hudi.apache.org/docs/writing_data), +[SQL](https://hudi.apache.org/docs/table_management) +and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one Review Comment: but there is no DeltaStreamer anymore. it was renamed to just Streamer https://hudi.apache.org/docs/hoodie_streaming_ingestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org