[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1571396063 ## CI report: * fa6a550cbe4c52a22314796d4926a34c04d0e6cc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17541) * 130523be1324218f56ce15ddc6ac3255e7cfcd9a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571390186 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * be4650d94edd3eca1a15a956b6a60e2b8ec5daca Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17542) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] marsupialtail commented on pull request #8768: [HUDI-1407] Basic python reader for Hudi
marsupialtail commented on PR #8768: URL: https://github.com/apache/hudi/pull/8768#issuecomment-1571389826 Any word on when writers are coming? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
danny0405 commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571366756 cc @xiarixiaoyao Do you have chance to take a look at this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8859: set operation to bulk insert on fresh table
danny0405 commented on code in PR #8859: URL: https://github.com/apache/hudi/pull/8859#discussion_r1212585549 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -111,8 +111,13 @@ trait ProvidesHoodieConfig extends Logging { val tableType = hoodieCatalogTable.tableTypeName val tableConfig = hoodieCatalogTable.tableConfig -val combinedOpts: Map[String, String] = combineOptions(hoodieCatalogTable, tableConfig, sparkSession.sqlContext.conf, +var combinedOpts: Map[String, String] = combineOptions(hoodieCatalogTable, tableConfig, sparkSession.sqlContext.conf, defaultOpts = Map.empty, overridingOpts = extraOptions) +if (!combinedOpts.getOrElse(INSERT_DROP_DUPS.key(),INSERT_DROP_DUPS.defaultValue()).toBoolean && Review Comment: ```suggestion if (!combinedOpts.getOrElse(INSERT_DROP_DUPS.key(), INSERT_DROP_DUPS.defaultValue()).toBoolean && ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8858: [HUDI-6301][UBER] Support SqlFileBasedSource for DeltaStreamer
danny0405 commented on code in PR #8858: URL: https://github.com/apache/hudi/pull/8858#discussion_r1212583839 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SqlFileBasedSource.java: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources; + +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +import org.apache.hudi.DataSourceUtils; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.utilities.schema.SchemaProvider; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Collections; +import java.util.Scanner; + +/** + * File-based SQL Source that uses SQL queries in a file to read from any table. + * + * SQL file path should be configured using this hoodie config: + * + * hoodie.deltastreamer.source.sql.file = 'hdfs://xxx/source.sql' + * + * File-based SQL Source is used for one time backfill scenarios, this won't update the deltastreamer.checkpoint.key + * to the processed commit, instead it will fetch the latest successful checkpoint key and set that value as + * this backfill commits checkpoint so that it won't interrupt the regular incremental processing. + * + * To fetch and use the latest incremental checkpoint, you need to also set this hoodie_conf for deltastremer jobs: + * + * hoodie.write.meta.key.prefixes = 'deltastreamer.checkpoint.key' + */ +public class SqlFileBasedSource extends RowSource { + + private static final Logger LOG = LoggerFactory.getLogger(SqlFileBasedSource.class); + private final String sourceSqlFile; + private final boolean shouldEmitCheckPoint; + + public SqlFileBasedSource( + TypedProperties props, + JavaSparkContext sparkContext, + SparkSession sparkSession, + SchemaProvider schemaProvider) { +super(props, sparkContext, sparkSession, schemaProvider); +DataSourceUtils.checkRequiredProperties( +props, Collections.singletonList(SqlFileBasedSource.Config.SOURCE_SQL_FILE)); +sourceSqlFile = props.getString(SqlFileBasedSource.Config.SOURCE_SQL_FILE); +shouldEmitCheckPoint = props.getBoolean(Config.EMIT_EPOCH_CHECKPOINT, false); + } + + @Override + protected Pair>, String> fetchNextBatch( + Option lastCkptStr, long sourceLimit) { +Dataset rows = null; +final FileSystem fs = FSUtils.getFs(sourceSqlFile, sparkContext.hadoopConfiguration(), true); +try { + final Scanner scanner = new Scanner(fs.open(new Path(sourceSqlFile))); + scanner.useDelimiter(";"); + while (scanner.hasNext()) { +String sqlStr = scanner.next().trim(); +if (!sqlStr.isEmpty()) { + LOG.info(sqlStr); + // overwrite the same dataset object until the last statement then return. + rows = sparkSession.sql(sqlStr); +} + } + return Pair.of(Option.of(rows), shouldEmitCheckPoint ? String.valueOf(System.currentTimeMillis()) : null); +} catch (IOException ioe) { + throw new HoodieIOException("Error reading source SQL file.", ioe); +} + } + + /** + * Configs supported. + */ + private static class Config { +private static final String SOURCE_SQL_FILE = "hoodie.deltastreamer.source.sql.file"; +private static final String EMIT_EPOCH_CHECKPOINT = "hoodie.deltastreamer.source.sql.checkpoint.emit"; Review Comment: Should we make these keys public? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8858: [HUDI-6301][UBER] Support SqlFileBasedSource for DeltaStreamer
danny0405 commented on code in PR #8858: URL: https://github.com/apache/hudi/pull/8858#discussion_r1212583636 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java: ## @@ -2325,7 +2325,7 @@ public void testCsvDFSSourceNoHeaderWithoutSchemaProviderAndWithTransformer() th public void testCsvDFSSourceNoHeaderWithSchemaProviderAndTransformer() throws Exception { // The CSV files do not have header, the columns are separated by '\t' // File schema provider is used, transformer is applied -// In this case, the source and target schema come from the Avro schema files +// In this case, the source and target schema come from the Avro schema filesTestSqlFileBasedSour testCsvDFSSource(false, '\t', true, Collections.singletonList(TripsWithDistanceTransformer.class.getName())); Review Comment: unnecessary change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8783: [HUDI-5569] Maintain commit timeline even in case of long standing inflights
danny0405 commented on code in PR #8783: URL: https://github.com/apache/hudi/pull/8783#discussion_r1212579733 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -460,19 +465,12 @@ private Stream getCommitInstantsToArchive() throws IOException { return !(firstSavepoint.isPresent() && compareTimestamps(firstSavepoint.get().getTimestamp(), LESSER_THAN_OR_EQUALS, s.getTimestamp())); } }).filter(s -> { -// Ensure commits >= the oldest pending compaction/replace commit is retained -return oldestPendingCompactionAndReplaceInstant +// oldestCommitToRetain is the highest completed commit instant that is less than the oldest inflight instant. +// By filter out any commit >= oldestCommitToRetain, we can ensure there are no gaps in the timeline +// when inflight commits are present. +return oldestCommitToRetain Review Comment: Yeah, it can fixes https://github.com/apache/hudi/pull/7738, although a little obscure to understand. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -460,19 +465,12 @@ private Stream getCommitInstantsToArchive() throws IOException { return !(firstSavepoint.isPresent() && compareTimestamps(firstSavepoint.get().getTimestamp(), LESSER_THAN_OR_EQUALS, s.getTimestamp())); } }).filter(s -> { -// Ensure commits >= the oldest pending compaction/replace commit is retained -return oldestPendingCompactionAndReplaceInstant +// oldestCommitToRetain is the highest completed commit instant that is less than the oldest inflight instant. +// By filter out any commit >= oldestCommitToRetain, we can ensure there are no gaps in the timeline +// when inflight commits are present. +return oldestCommitToRetain Review Comment: Yeah, it can fix https://github.com/apache/hudi/pull/7738, although a little obscure to understand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
SteNicholas commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1212572127 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java: ## @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static org.apache.parquet.filter2.predicate.FilterApi.notEq; +import static org.apache.parquet.filter2.predicate.FilterApi.or; +import static org.apache.parquet.io.api.Binary.fromConstantByteArray; +import static org.apache.parquet.io.api.Binary.fromString; + +/** + * Tool to predicate the {@link org.apache.flink.table.expressions.ResolvedExpression}s. + */ +public class ExpressionPredicates { Review Comment: @XuQianJin-Stars, why implement `ExpressionVisitor` interface? `ExpressionEvaluators` doesn't also implement this interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571341385 ## CI report: * 0781a5259d9432c5264f642e392b27ca5eadf6fd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17547) * 8f4d284eef389ff420b473b9a721212de46302da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17549) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
SteNicholas commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1212572127 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java: ## @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static org.apache.parquet.filter2.predicate.FilterApi.notEq; +import static org.apache.parquet.filter2.predicate.FilterApi.or; +import static org.apache.parquet.io.api.Binary.fromConstantByteArray; +import static org.apache.parquet.io.api.Binary.fromString; + +/** + * Tool to predicate the {@link org.apache.flink.table.expressions.ResolvedExpression}s. + */ +public class ExpressionPredicates { Review Comment: @XuQianJin-Stars, why implement `ExpressionVisitor` interface? `ExpressionEvoluators` doesn't also implement this interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1571340159 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 55ece2d5ee0d752f48d7c33cd6b8235b5ffacac5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17540) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
SteNicholas commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1212572127 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java: ## @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static org.apache.parquet.filter2.predicate.FilterApi.notEq; +import static org.apache.parquet.filter2.predicate.FilterApi.or; +import static org.apache.parquet.io.api.Binary.fromConstantByteArray; +import static org.apache.parquet.io.api.Binary.fromString; + +/** + * Tool to predicate the {@link org.apache.flink.table.expressions.ResolvedExpression}s. + */ +public class ExpressionPredicates { Review Comment: Why implement `ExpressionVisitor` interface? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571335287 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17538) * 0781a5259d9432c5264f642e392b27ca5eadf6fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17547) * 8f4d284eef389ff420b473b9a721212de46302da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
hudi-bot commented on PR #8792: URL: https://github.com/apache/hudi/pull/8792#issuecomment-1571335091 ## CI report: * f07a16ccf74a657ff312c5706d617fdaac23a752 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17546) * 683dc368e714ace1c44d741d642f1fe64b7910b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17548) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1571330406 ## CI report: * fa6a550cbe4c52a22314796d4926a34c04d0e6cc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17541) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
hudi-bot commented on PR #8792: URL: https://github.com/apache/hudi/pull/8792#issuecomment-1571330366 ## CI report: * 14dd620516bf89c288bf437aec74e08ee3da27a3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17454) * f07a16ccf74a657ff312c5706d617fdaac23a752 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17546) * 683dc368e714ace1c44d741d642f1fe64b7910b2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8855: [SUPPORT][FLINK SQL] Can not insert join result into hudi table
danny0405 commented on issue #8855: URL: https://github.com/apache/hudi/issues/8855#issuecomment-1571318663 The stream writer has no outputs, but it would flush the records to Hudi table anyway. Does your job checkpoint proceeds normally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6304) [UBER] Rollback enhancements
Danny Chen created HUDI-6304: Summary: [UBER] Rollback enhancements Key: HUDI-6304 URL: https://issues.apache.org/jira/browse/HUDI-6304 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Danny Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #8860: Fix spark sql core flow
danny0405 commented on code in PR #8860: URL: https://github.com/apache/hudi/pull/8860#discussion_r1212546857 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala: ## @@ -46,18 +47,18 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase { "COPY_ON_WRITE|false|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE", "COPY_ON_WRITE|true|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE", "COPY_ON_WRITE|true|true|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE", - "COPY_ON_WRITE|false|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM", - "COPY_ON_WRITE|true|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM", - "COPY_ON_WRITE|true|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM", +// "COPY_ON_WRITE|false|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM", +// "COPY_ON_WRITE|true|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM", Review Comment: can we just remove the comment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8830: [MINOR] auto generate init client id
danny0405 commented on code in PR #8830: URL: https://github.com/apache/hudi/pull/8830#discussion_r1212541100 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/configuration/TestOptionsInference.java: ## @@ -69,6 +70,12 @@ void testSetupClientId() throws Exception { } } + @Test + void testAutoGenerateClient() { + Configuration conf = getConf(); + OptionsInference.setupClientId(conf); + assertNotNull(conf.getString(FlinkOptions.WRITE_CLIENT_ID), "auto generate client failed!"); + } Review Comment: Can you explain again why we need this fix, when a Flink job was submitted, it is assigned a client id automically, this id is hold by this job with a hearbeat, when another job was submitted, it saw the heartbeat and would create a new client id, is this what you saw in you case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571301873 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17538) * 0781a5259d9432c5264f642e392b27ca5eadf6fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17547) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
hudi-bot commented on PR #8792: URL: https://github.com/apache/hudi/pull/8792#issuecomment-1571301640 ## CI report: * 14dd620516bf89c288bf437aec74e08ee3da27a3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17454) * f07a16ccf74a657ff312c5706d617fdaac23a752 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17546) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8861: [HUDI-6303] Bump flink version to 1.16.2 and 1.17.1
hudi-bot commented on PR #8861: URL: https://github.com/apache/hudi/pull/8861#issuecomment-1571296759 ## CI report: * 611580e4214c5a3cc1922f06fb820b5fc3a7355e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17545) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571296743 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17538) * 0781a5259d9432c5264f642e392b27ca5eadf6fd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
hudi-bot commented on PR #8792: URL: https://github.com/apache/hudi/pull/8792#issuecomment-1571296608 ## CI report: * 14dd620516bf89c288bf437aec74e08ee3da27a3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17454) * f07a16ccf74a657ff312c5706d617fdaac23a752 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8861: [HUDI-6303] Bump flink version to 1.16.2 and 1.17.1
hudi-bot commented on PR #8861: URL: https://github.com/apache/hudi/pull/8861#issuecomment-1571292341 ## CI report: * 611580e4214c5a3cc1922f06fb820b5fc3a7355e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8840: [HUDI-5352] Fix `LocalDate` serialization in colstats
danny0405 commented on code in PR #8840: URL: https://github.com/apache/hudi/pull/8840#discussion_r1212530303 ## hudi-common/src/main/java/org/apache/hudi/common/util/JsonUtils.java: ## @@ -20,41 +20,74 @@ package org.apache.hudi.common.util; import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.util.Lazy; import com.fasterxml.jackson.annotation.JsonAutoDetect; import com.fasterxml.jackson.annotation.PropertyAccessor; import com.fasterxml.jackson.core.JsonProcessingException; import com.fasterxml.jackson.databind.DeserializationFeature; import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.SerializationFeature; +import com.fasterxml.jackson.databind.util.StdDateFormat; +import com.fasterxml.jackson.datatype.jsr310.JavaTimeModule; /** * Utils for JSON serialization and deserialization. */ public class JsonUtils { - private static final ObjectMapper MAPPER = new ObjectMapper(); - - static { -MAPPER.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES); -// We need to exclude custom getters, setters and creators which can use member fields -// to derive new fields, so that they are not included in the serialization -MAPPER.setVisibility(PropertyAccessor.FIELD, JsonAutoDetect.Visibility.ANY); -MAPPER.setVisibility(PropertyAccessor.GETTER, JsonAutoDetect.Visibility.NONE); -MAPPER.setVisibility(PropertyAccessor.IS_GETTER, JsonAutoDetect.Visibility.NONE); -MAPPER.setVisibility(PropertyAccessor.SETTER, JsonAutoDetect.Visibility.NONE); -MAPPER.setVisibility(PropertyAccessor.CREATOR, JsonAutoDetect.Visibility.NONE); - } + private static final Lazy MAPPER = Lazy.lazily(JsonUtils::instantiateObjectMapper); public static ObjectMapper getObjectMapper() { -return MAPPER; +return MAPPER.get(); } public static String toString(Object value) { try { - return MAPPER.writeValueAsString(value); + return MAPPER.get().writeValueAsString(value); } catch (JsonProcessingException e) { throw new HoodieIOException( "Fail to convert the class: " + value.getClass().getName() + " to Json String", e); } } + + private static ObjectMapper instantiateObjectMapper() { +ObjectMapper mapper = new ObjectMapper(); + +registerModules(mapper); + +// We're writing out dates as their string representations instead of (int) timestamps +mapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS); +// NOTE: This is necessary to make sure that w/ Jackson >= 2.11 colon is not infixed +// into the timezone value ("+00:00" as opposed to "+" before 2.11) +// While Jackson is able to parse both of these formats, we keep it as false +// to make sure metadata produced by Hudi stays consistent across Jackson versions +configureColonInTimezone(mapper); Review Comment: Yeah, wondering when we need to serialize the column stats to json string? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6293) Make HoodieFlinkCompactor's parallelism of compact_task more reasonable.
[ https://issues.apache.org/jira/browse/HUDI-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6293. Resolution: Fixed Fixed via master branch: 00d50e91abe24aba31daa2fe2806de5414f03c77 > Make HoodieFlinkCompactor's parallelism of compact_task more reasonable. > - > > Key: HUDI-6293 > URL: https://issues.apache.org/jira/browse/HUDI-6293 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: eric >Priority: Major > Labels: pull-request-available > Attachments: image-2023-05-31-16-41-02-798.png > > > !image-2023-05-31-16-41-02-798.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6293) Make HoodieFlinkCompactor's parallelism of compact_task more reasonable.
[ https://issues.apache.org/jira/browse/HUDI-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6293: - Fix Version/s: 0.14.0 > Make HoodieFlinkCompactor's parallelism of compact_task more reasonable. > - > > Key: HUDI-6293 > URL: https://issues.apache.org/jira/browse/HUDI-6293 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: image-2023-05-31-16-41-02-798.png > > > !image-2023-05-31-16-41-02-798.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6293] Make HoodieFlinkCompactor's parallelism of compact_task more reasonable (#8854)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 00d50e91abe [HUDI-6293] Make HoodieFlinkCompactor's parallelism of compact_task more reasonable (#8854) 00d50e91abe is described below commit 00d50e91abe24aba31daa2fe2806de5414f03c77 Author: Dongsj <90449228+eric9...@users.noreply.github.com> AuthorDate: Thu Jun 1 11:38:27 2023 +0800 [HUDI-6293] Make HoodieFlinkCompactor's parallelism of compact_task more reasonable (#8854) Co-authored-by: dongsj --- .../java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java index 63dfd26c4ac..e396897dc7e 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java @@ -274,10 +274,12 @@ public class HoodieFlinkCompactor { List instants = compactionInstantTimes.stream().map(HoodieTimeline::getCompactionRequestedInstant).collect(Collectors.toList()); + int totalOperations = Math.toIntExact(compactionPlans.stream().mapToLong(pair -> pair.getRight().getOperations().size()).sum()); + // get compactionParallelism. int compactionParallelism = conf.getInteger(FlinkOptions.COMPACTION_TASKS) == -1 - ? Math.toIntExact(compactionPlans.stream().mapToLong(pair -> pair.getRight().getOperations().size()).sum()) - : conf.getInteger(FlinkOptions.COMPACTION_TASKS); + ? totalOperations + : Math.min(conf.getInteger(FlinkOptions.COMPACTION_TASKS), totalOperations); LOG.info("Start to compaction for instant " + compactionInstantTimes);
[GitHub] [hudi] danny0405 merged pull request #8854: [HUDI-6293]Make HoodieFlinkCompactor's parallelism of compact_task m…
danny0405 merged PR #8854: URL: https://github.com/apache/hudi/pull/8854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
danny0405 commented on code in PR #8792: URL: https://github.com/apache/hudi/pull/8792#discussion_r1212527300 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -93,7 +93,7 @@ public static HoodieWriteConfig createMetadataWriteConfig( .withCleanerParallelism(parallelism) .withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) .withFailedWritesCleaningPolicy(failedWritesCleaningPolicy) -.retainCommits(DEFAULT_METADATA_CLEANER_COMMITS_RETAINED) + .retainCommits(Math.min(writeConfig.getCleanerCommitsRetained(),DEFAULT_METADATA_CLEANER_COMMITS_RETAINED)) Review Comment: ```suggestion .retainCommits(Math.min(writeConfig.getCleanerCommitsRetained(), DEFAULT_METADATA_CLEANER_COMMITS_RETAINED)) ``` ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBase.java: ## @@ -400,7 +400,7 @@ protected HoodieWriteConfig getMetadataWriteConfig(HoodieWriteConfig writeConfig .withCleanerParallelism(parallelism) .withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS) .withFailedWritesCleaningPolicy(HoodieFailedWritesCleaningPolicy.LAZY) -.retainCommits(DEFAULT_METADATA_CLEANER_COMMITS_RETAINED) + .retainCommits(Math.min(writeConfig.getCleanerCommitsRetained(),DEFAULT_METADATA_CLEANER_COMMITS_RETAINED)) Review Comment: ```suggestion .retainCommits(Math.min(writeConfig.getCleanerCommitsRetained(), DEFAULT_METADATA_CLEANER_COMMITS_RETAINED)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8740: [HUDI-6231] Handle glue comments
danny0405 commented on PR #8740: URL: https://github.com/apache/hudi/pull/8740#issuecomment-1571277471 @parisni Hi, do we have plan to push-forward this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8857: [SUPPORT] Column comments not syncing to AWS Glue Catalog
danny0405 commented on issue #8857: URL: https://github.com/apache/hudi/issues/8857#issuecomment-1571274339 Guess this is what you needed: https://github.com/apache/hudi/pull/8740/files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
XuQianJin-Stars commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1212523550 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java: ## @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static org.apache.parquet.filter2.predicate.FilterApi.notEq; +import static org.apache.parquet.filter2.predicate.FilterApi.or; +import static org.apache.parquet.io.api.Binary.fromConstantByteArray; +import static org.apache.parquet.io.api.Binary.fromString; + +/** + * Tool to predicate the {@link org.apache.flink.table.expressions.ResolvedExpression}s. + */ +public class ExpressionPredicates { Review Comment: ```implements ExpressionVisitor``` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
XuQianJin-Stars commented on code in PR #8437: URL: https://github.com/apache/hudi/pull/8437#discussion_r1212523550 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java: ## @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static org.apache.parquet.filter2.predicate.FilterApi.notEq; +import static org.apache.parquet.filter2.predicate.FilterApi.or; +import static org.apache.parquet.io.api.Binary.fromConstantByteArray; +import static org.apache.parquet.io.api.Binary.fromString; + +/** + * Tool to predicate the {@link org.apache.flink.table.expressions.ResolvedExpression}s. + */ +public class ExpressionPredicates { Review Comment: implements ExpressionVisitor ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on pull request #8076: [HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table
codope commented on PR #8076: URL: https://github.com/apache/hudi/pull/8076#issuecomment-1571270237 Looks good to me. @yihua @nsivabalan If you can take one pass, that would be great. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6286) Overwrite mode should not delete old data
[ https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6286: - Assignee: Hui An (was: Sagar Sumit) > Overwrite mode should not delete old data > - > > Key: HUDI-6286 > URL: https://issues.apache.org/jira/browse/HUDI-6286 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Reporter: Hui An >Assignee: Hui An >Priority: Major > Fix For: 0.14.0 > > > https://github.com/apache/hudi/pull/8076/files#r1127283648 > For *Overwrite* mode, we should not delete the basePath. Just overwrite the > existing data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6286) Overwrite mode should not delete old data
[ https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6286: -- Fix Version/s: 0.14.0 > Overwrite mode should not delete old data > - > > Key: HUDI-6286 > URL: https://issues.apache.org/jira/browse/HUDI-6286 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Reporter: Hui An >Priority: Major > Fix For: 0.14.0 > > > https://github.com/apache/hudi/pull/8076/files#r1127283648 > For *Overwrite* mode, we should not delete the basePath. Just overwrite the > existing data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6286) Overwrite mode should not delete old data
[ https://issues.apache.org/jira/browse/HUDI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-6286: - Assignee: Sagar Sumit > Overwrite mode should not delete old data > - > > Key: HUDI-6286 > URL: https://issues.apache.org/jira/browse/HUDI-6286 > Project: Apache Hudi > Issue Type: Bug > Components: spark, writer-core >Reporter: Hui An >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.14.0 > > > https://github.com/apache/hudi/pull/8076/files#r1127283648 > For *Overwrite* mode, we should not delete the basePath. Just overwrite the > existing data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6290) Fix Flink MDT compaction strategy
[ https://issues.apache.org/jira/browse/HUDI-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6290. Resolution: Fixed Fixed via master branch: 399b46c494252f16a4efcd9f19ccfa66b488b563 > Fix Flink MDT compaction strategy > - > > Key: HUDI-6290 > URL: https://issues.apache.org/jira/browse/HUDI-6290 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (6671d1a2460 -> 399b46c4942)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 6671d1a2460 [HUDI-6294] Fix log classname in SqlQuerySingleResultPreCommitValidator (#8853) add 399b46c4942 [HUDI-6290] Fix Flink MDT compaction strategy (#8850) No new revisions were added by this update. Summary of changes: .../metadata/HoodieBackedTableMetadataWriter.java | 51 ++ .../FlinkHoodieBackedTableMetadataWriter.java | 8 .../apache/hudi/table/ITTestHoodieDataSource.java | 8 ++-- 3 files changed, 43 insertions(+), 24 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #8850: [HUDI-6290] Fix Flink MDT compaction strategy
danny0405 merged PR #8850: URL: https://github.com/apache/hudi/pull/8850 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on a diff in pull request #8792: [HUDI-6256] Fix the data table archiving and MDT cleaning config conf…
flashJd commented on code in PR #8792: URL: https://github.com/apache/hudi/pull/8792#discussion_r1212516124 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -247,34 +247,16 @@ void testSyncMetadataTable() throws Exception { assertThat("One instant need to sync to metadata table", completedTimeline.countInstants(), is(7)); assertThat(completedTimeline.nthFromLastInstant(1).get().getTimestamp(), is(instant + "001")); assertThat(completedTimeline.nthFromLastInstant(1).get().getAction(), is(HoodieTimeline.COMMIT_ACTION)); -// write another 2 commits -for (int i = 7; i < 8; i++) { + +// write another 27 commits to trigger the first clean +for (int i = 7; i < 34; i++) { instant = mockWriteWithMetadata(); - metadataTableMetaClient.reloadActiveTimeline(); - completedTimeline = metadataTableMetaClient.getActiveTimeline().filterCompletedInstants(); - assertThat("One instant need to sync to metadata table", completedTimeline.countInstants(), is(i + 1)); Review Comment: That sounds reasonable,I've modified it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
SteNicholas commented on PR #8437: URL: https://github.com/apache/hudi/pull/8437#issuecomment-1571256809 @danny0405, I have addressed comments. Please help to review again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571254989 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 720b8aec68410a1e48e57fc540b1885bee6a7423 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17535) * be4650d94edd3eca1a15a956b6a60e2b8ec5daca Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17542) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8830: [MINOR] auto generate init client id
hudi-bot commented on PR #8830: URL: https://github.com/apache/hudi/pull/8830#issuecomment-1571254881 ## CI report: * 065791647cb669b5d18c4c1f4e9dfcf9d4e0c9c1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17537) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
SteNicholas commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571254679 @xiarixiaoyao, @yihua, could you help to review this pull request? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6303) Bump flink version to 1.16.2 and 1.17.1
[ https://issues.apache.org/jira/browse/HUDI-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6303: - Labels: pull-request-available (was: ) > Bump flink version to 1.16.2 and 1.17.1 > --- > > Key: HUDI-6303 > URL: https://issues.apache.org/jira/browse/HUDI-6303 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Weijie Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] reswqa opened a new pull request, #8861: [HUDI-6303] Bump flink version to 1.16.2 and 1.17.1
reswqa opened a new pull request, #8861: URL: https://github.com/apache/hudi/pull/8861 ### Change Logs Bump flink version to 1.16.2 and 1.17.1 ### Impact Expected no impact ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
nsivabalan commented on code in PR #8837: URL: https://github.com/apache/hudi/pull/8837#discussion_r1212500637 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -837,10 +840,75 @@ public void update(HoodieCleanMetadata cleanMetadata, String instantTime) { */ @Override public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { -processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, -metadataMetaClient.getActiveTimeline(), restoreMetadata, getRecordsGenerationParams(), instantTime, -metadata.getSyncedInstantTime()), false); -closeInternal(); +dataMetaClient.reloadActiveTimeline(); + +// Since the restore has completed on the dataset, the latest write timeline instant is the one to which the +// restore was performed. This should be always present. +final String restoreToInstantTime = dataMetaClient.getActiveTimeline().getWriteTimeline() +.getReverseOrderedInstants().findFirst().get().getTimestamp(); + +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. +Option lastCompaction = metadataMetaClient.getCommitTimeline().filterCompletedInstants().lastInstant(); Review Comment: shouldn't this logic go into restore method in BaseHoodieWriteClient where we trigger the restore for data table. so, this is double/repeated validation right? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -669,32 +669,51 @@ public void restoreToSavepoint() { * @param savepointTime Savepoint time to rollback to */ public void restoreToSavepoint(String savepointTime) { -boolean initialMetadataTableIfNecessary = config.isMetadataTableEnabled(); -if (initialMetadataTableIfNecessary) { +boolean initializeMetadataTableIfNecessary = config.isMetadataTableEnabled(); +if (initializeMetadataTableIfNecessary) { try { -// Delete metadata table directly when users trigger savepoint rollback if mdt existed and beforeTimelineStarts +// Delete metadata table directly when users trigger savepoint rollback if mdt existed and if the savePointTime is beforeTimelineStarts +// or before the oldest compaction on MDT. +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. String metadataTableBasePathStr = HoodieTableMetadata.getMetadataTableBasePath(config.getBasePath()); HoodieTableMetaClient mdtClient = HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(metadataTableBasePathStr).build(); -// Same as HoodieTableMetadataUtil#processRollbackMetadata +Option lastCompaction = mdtClient.getCommitTimeline().filterCompletedInstants().lastInstant(); Review Comment: rename to "latestMdtCompaction" ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -837,10 +840,75 @@ public void update(HoodieCleanMetadata cleanMetadata, String instantTime) { */ @Override public void update(HoodieRestoreMetadata restoreMetadata, String instantTime) { -processAndCommit(instantTime, () -> HoodieTableMetadataUtil.convertMetadataToRecords(engineContext, -metadataMetaClient.getActiveTimeline(), restoreMetadata, getRecordsGenerationParams(), instantTime, -metadata.getSyncedInstantTime()), false); -closeInternal(); +dataMetaClient.reloadActiveTimeline(); + +// Since the restore has completed on the dataset, the latest write timeline instant is the one to which the +// restore was performed. This should be always present. +final String restoreToInstantTime = dataMetaClient.getActiveTimeline().getWriteTimeline() +.getReverseOrderedInstants().findFirst().get().getTimestamp(); + +// We cannot restore to before the oldest compaction on MDT as we don't have the basefiles before that time. +Option lastCompaction = metadataMetaClient.getCommitTimeline().filterCompletedInstants().lastInstant(); +if (lastCompaction.isPresent()) { + if (HoodieTimeline.LESSER_THAN_OR_EQUALS.test(restoreToInstantTime, lastCompaction.get().getTimestamp())) { +String msg = String.format("Cannot restore MDT to %s because it is older than latest compaction at %s", restoreToInstantTime, +lastCompaction.get().getTimestamp()) + ". Please delete MDT and restore again"; +LOG.error(msg); +throw new HoodieMetadataException(msg); + } +} + +// Restore requires the existing pipelines to be shutdown. So we can safely scan the dataset to find the current +// list of files in the filesystem. +List dirInfoList = listAllPartitions(dataMetaClient); +Map dirInfoMap =
[jira] [Created] (HUDI-6303) Bump flink version to 1.16.2 and 1.17.1
Weijie Guo created HUDI-6303: Summary: Bump flink version to 1.16.2 and 1.17.1 Key: HUDI-6303 URL: https://issues.apache.org/jira/browse/HUDI-6303 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: Weijie Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571247655 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 720b8aec68410a1e48e57fc540b1885bee6a7423 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17535) * be4650d94edd3eca1a15a956b6a60e2b8ec5daca UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571242238 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17538) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #7304: [HUDI-5278] Support more conf to cluster procedure
boneanxs commented on code in PR #7304: URL: https://github.com/apache/hudi/pull/7304#discussion_r1212496811 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/procedure/TestClusteringProcedure.scala: ## @@ -385,4 +395,261 @@ class TestClusteringProcedure extends HoodieSparkProcedureTestBase { } } } + + test("Test Call run_clustering Procedure with specific instants") { +withTempDir { tmp => + val tableName = generateTableName + val basePath = s"${tmp.getCanonicalPath}/$tableName" + + spark.sql( +s""" + |create table $tableName ( + | c1 int, + | c2 string, + | c3 double + |) using hudi + | options ( + | primaryKey = 'c1', + | type = 'cow', + | hoodie.metadata.enable = 'true', + | hoodie.metadata.index.column.stats.enable = 'true', + | hoodie.enable.data.skipping = 'true', + | hoodie.datasource.write.operation = 'insert' + | ) + | location '$basePath' + """.stripMargin) + + writeRecords(2, 4, 0, basePath, Map("hoodie.avro.schema.validate" -> "false")) + spark.sql(s"call run_clustering(table => '$tableName', op => 'schedule')") + + writeRecords(2, 4, 0, basePath, Map("hoodie.avro.schema.validate" -> "false")) + spark.sql(s"call run_clustering(table => '$tableName', op => 'schedule')") + + val conf = new Configuration + val metaClient = HoodieTableMetaClient.builder.setConf(conf).setBasePath(basePath).build + val instants = metaClient.getActiveTimeline.filterPendingReplaceTimeline().getInstants.iterator().asScala.map(_.getTimestamp).toSeq + assert(2 == instants.size) + + checkExceptionContain( +s"call run_clustering(table => '$tableName', instants => '00, ${instants.head}')" + )("specific 00 instants is not exist") + metaClient.reloadActiveTimeline() + assert(0 == metaClient.getActiveTimeline.getCompletedReplaceTimeline.getInstants.size()) + assert(2 == metaClient.getActiveTimeline.filterPendingReplaceTimeline.getInstants.size()) + + writeRecords(2, 4, 0, basePath, Map("hoodie.avro.schema.validate" -> "false")) + // specific instants will not schedule new cluster plan + spark.sql(s"call run_clustering(table => '$tableName', instants => '${instants.mkString(",")}')") + metaClient.reloadActiveTimeline() + assert(2 == metaClient.getActiveTimeline.getCompletedReplaceTimeline.getInstants.size()) + assert(0 == metaClient.getActiveTimeline.filterPendingReplaceTimeline.getInstants.size()) + + // test with operator schedule + checkExceptionContain( + s"call run_clustering(table => '$tableName', instants => '00', op => 'schedule')" + )("specific instants only can be used in 'execute' op or not specific op") + + // test with operator scheduleAndExecute + checkExceptionContain( +s"call run_clustering(table => '$tableName', instants => '00', op => 'scheduleAndExecute')" + )("specific instants only can be used in 'execute' op or not specific op") + + // test with operator execute + spark.sql(s"call run_clustering(table => '$tableName', op => 'schedule')") + metaClient.reloadActiveTimeline() + val instants2 = metaClient.getActiveTimeline.filterPendingReplaceTimeline().getInstants.iterator().asScala.map(_.getTimestamp).toSeq + spark.sql(s"call run_clustering(table => '$tableName', instants => '${instants2.mkString(",")}', op => 'execute')") + metaClient.reloadActiveTimeline() + assert(3 == metaClient.getActiveTimeline.getCompletedReplaceTimeline.getInstants.size()) + assert(0 == metaClient.getActiveTimeline.filterPendingReplaceTimeline.getInstants.size()) +} + } + + test("Test Call run_clustering Procedure op") { +withTempDir { tmp => + val tableName = generateTableName + val basePath = s"${tmp.getCanonicalPath}/$tableName" + + spark.sql( +s""" + |create table $tableName ( + | c1 int, + | c2 string, + | c3 double + |) using hudi + | options ( + | primaryKey = 'c1', + | type = 'cow', + | hoodie.metadata.enable = 'true', + | hoodie.metadata.index.column.stats.enable = 'true', + | hoodie.enable.data.skipping = 'true', + | hoodie.datasource.write.operation = 'insert' + | ) + | location '$basePath' + """.stripMargin) + + writeRecords(2, 4, 0, basePath, Map("hoodie.avro.schema.validate"-> "false")) + val conf = new Configuration + val metaClient = HoodieTableMetaClient.builder.setConf(conf).setBasePath(basePath).build + assert(0 == metaClient.getActiveTimeline.getCompletedReplaceTimeline.getInstants.size()) +
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1571212001 ## CI report: * 2291d56660efdbace36ec3e52a05a41ab236fff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17501) * fa6a550cbe4c52a22314796d4926a34c04d0e6cc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17541) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1571210983 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 5d3bc78d0eab3a5a1b72ba54a0577d36bee7845b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17511) * 55ece2d5ee0d752f48d7c33cd6b8235b5ffacac5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17540) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8795: [HUDI-6258] support olap engine query mor table in table name without ro/rt suffix
hudi-bot commented on PR #8795: URL: https://github.com/apache/hudi/pull/8795#issuecomment-1571205582 ## CI report: * 2291d56660efdbace36ec3e52a05a41ab236fff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17501) * fa6a550cbe4c52a22314796d4926a34c04d0e6cc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1571205102 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 5d3bc78d0eab3a5a1b72ba54a0577d36bee7845b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17511) * 55ece2d5ee0d752f48d7c33cd6b8235b5ffacac5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8860: Fix spark sql core flow
hudi-bot commented on PR #8860: URL: https://github.com/apache/hudi/pull/8860#issuecomment-1571201295 ## CI report: * 7dfdce4528315764dd140b5e186f1b6052c9be43 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17539) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571201237 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 720b8aec68410a1e48e57fc540b1885bee6a7423 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17535) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571201140 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * 19630df001d430cdf4ec6413c6f025e48bb84bb3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17533) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on a diff in pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
KnightChess commented on code in PR #8856: URL: https://github.com/apache/hudi/pull/8856#discussion_r1212457103 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -842,7 +842,8 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() .stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); +int totalDeleteFileSize = partitionToDeletedFilesList.stream().mapToInt(p -> p.getValue().size()).sum(); +int parallelism = Math.max(Math.min(totalDeleteFileSize, recordsGenerationParams.getBloomIndexParallelism()), 1); Review Comment: oh, I got it. I mistakenly think that the later processing logic is the same as the update, I will try fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8860: Fix spark sql core flow
hudi-bot commented on PR #8860: URL: https://github.com/apache/hudi/pull/8860#issuecomment-1571161421 ## CI report: * 7dfdce4528315764dd140b5e186f1b6052c9be43 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17539) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571161385 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17538) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8830: [MINOR] auto generate init client id
hudi-bot commented on PR #8830: URL: https://github.com/apache/hudi/pull/8830#issuecomment-1571161237 ## CI report: * b17cd27589692f3d8cc37865eaa7c75b12ec7141 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17516) * 065791647cb669b5d18c4c1f4e9dfcf9d4e0c9c1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17537) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8849: [UBER] Rollback enhancements
nsivabalan commented on code in PR #8849: URL: https://github.com/apache/hudi/pull/8849#discussion_r1212454576 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java: ## @@ -240,22 +244,21 @@ protected void finishRollback(HoodieInstant inflightInstant, HoodieRollbackMetad this.txnManager.beginTransaction(Option.of(inflightInstant), Option.empty()); } - // If publish the rollback to the timeline, we first write the rollback metadata - // to metadata table + // If publish the rollback to the timeline, we first write the rollback metadata to metadata table + // Then transition the inflight rollback to completed state. if (!skipTimelinePublish) { writeTableMetadata(rollbackMetadata); - } - - // Then we delete the inflight instant in the data table timeline if enabled - deleteInflightAndRequestedInstant(deleteInstants, table.getActiveTimeline(), resolvedInstant); - - // If publish the rollback to the timeline, we finally transition the inflight rollback - // to complete in the data table timeline - if (!skipTimelinePublish) { table.getActiveTimeline().transitionRollbackInflightToComplete(inflightInstant, TimelineMetadataUtils.serializeRollbackMetadata(rollbackMetadata)); LOG.info("Rollback of Commits " + rollbackMetadata.getCommitsRollback() + " is complete"); } + + // Commit to rollback instant files are deleted after the rollback commit is transitioned from inflight to completed + // If job were to fail after transitioning rollback from inflight to complete and before delete the instant files, + // then subsequent retries of the rollback for this instant will see if there is a completed rollback present for this instant + // and then directly delete the files and abort. Review Comment: can you point me to this part of the code. Where we check for completed rollback instants and clean up the files directly ? From what I know, we only check for pending rollbacks ref: getPendingRollbackInfos and re-use the rollback instant if applicable. Also, do we have a test for this ? ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestCopyOnWriteRollbackActionExecutor.java: ## @@ -328,4 +340,213 @@ private void performRollbackAndValidate(boolean isUsingMarkers, HoodieWriteConfi assertFalse(WriteMarkersFactory.get(cfg.getMarkersType(), table, commitInstant.getTimestamp()).doesMarkerDirExist()); } + + /** + * This method tests rollback of completed ingestion commits and replacecommit inflight files + * when there is another replacecommit with greater timestamp already present in the timeline. + */ + @Test + public void testRollbackWhenReplaceCommitIsPresent() throws Exception { Review Comment: can you move these into separate PRs. will help landing the PRs at diff pace. also, the PR description will be succinct. its fine for this PR. but in future ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/cluster/ClusteringTestBase.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table.action.cluster; + +import org.apache.hudi.client.SparkRDDWriteClient; +import org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator; +import org.apache.hudi.common.config.HoodieStorageConfig; +import org.apache.hudi.common.fs.ConsistencyGuardConfig; +import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.FileSystemViewStorageType; +import org.apache.hudi.common.testutils.HoodieTestDataGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieCleanConfig; +import org.apache.hudi.config.HoodieClusteringConfig; +import
[GitHub] [hudi] nsivabalan commented on pull request #8849: [UBER] Rollback enhancements
nsivabalan commented on PR #8849: URL: https://github.com/apache/hudi/pull/8849#issuecomment-1571156439 can we please create a jira and link it in title.? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8860: Fix spark sql core flow
hudi-bot commented on PR #8860: URL: https://github.com/apache/hudi/pull/8860#issuecomment-1571153775 ## CI report: * 7dfdce4528315764dd140b5e186f1b6052c9be43 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8859: set operation to bulk insert on fresh table
hudi-bot commented on PR #8859: URL: https://github.com/apache/hudi/pull/8859#issuecomment-1571153750 ## CI report: * 641c66d64597d8d83f6234172c3d3dc2c8bc358e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8830: [MINOR] auto generate init client id
hudi-bot commented on PR #8830: URL: https://github.com/apache/hudi/pull/8830#issuecomment-1571153675 ## CI report: * b17cd27589692f3d8cc37865eaa7c75b12ec7141 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17516) * 065791647cb669b5d18c4c1f4e9dfcf9d4e0c9c1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8826: [DO NOT MERGE][HUDI-6198] Testing Spark 3.4.0 Upgrade on 0.13.1
hudi-bot commented on PR #8826: URL: https://github.com/apache/hudi/pull/8826#issuecomment-1571153635 ## CI report: * e782ae92011fc81144dd9b488e6c2bfd9d5d5906 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17536) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5257) Spark-Sql duplicates and re-uses record keys under certain configs and use cases
[ https://issues.apache.org/jira/browse/HUDI-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-5257. - Resolution: Not A Problem I think this was due to caching dataframes incorrectly > Spark-Sql duplicates and re-uses record keys under certain configs and use > cases > > > Key: HUDI-5257 > URL: https://issues.apache.org/jira/browse/HUDI-5257 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap, spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Attachments: bad_data.txt > > > On a new table with primary key _row_key and partitioned by partition_path, > if you do a bulk insert by: > {code:java} > insertDf.createOrReplaceTempView("insert_temp_table") > spark.sql(s"set hoodie.datasource.write.operation=bulk_insert") > spark.sql("set hoodie.sql.bulk.insert.enable=true") > spark.sql("set hoodie.sql.insert.mode=non-strict") > spark.sql(s"insert into $tableName select * from insert_temp_table") {code} > you will get data with: [^bad_data.txt] where multiple records have the same > key even though they have different primary key values, and that there are > multiple files even though there are only 10 records > changing hoodie.datasource.write.operation=bulk_insert to > hoodie.datasource.write.operation=insert causes the data to be inserted > correctly. I do not know if it is using bulk insert with this change. > > However, if you use bulk insert with raw data like > {code:java} > spark.sql(s""" > | insert into $tableName values > | $values > |""".stripMargin > ){code} > where $values is something like > {code:java} > (1, 'a1', 10, 1000, "2021-01-05"), {code} > then hoodie.datasource.write.operation=bulk_insert works as expected -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #8860: Fix spark sql core flow
jonvex opened a new pull request, #8860: URL: https://github.com/apache/hudi/pull/8860 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8826: [DO NOT MERGE][HUDI-6198] Testing Spark 3.4.0 Upgrade on 0.13.1
hudi-bot commented on PR #8826: URL: https://github.com/apache/hudi/pull/8826#issuecomment-1571147923 ## CI report: * 6b0dfada89e8aa7a083d717cbe38fff36c3ce745 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17526) * e782ae92011fc81144dd9b488e6c2bfd9d5d5906 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #7738: [HUDI-5569] Fixing TableFileSystemView to detect early failed commits
nsivabalan closed pull request #7738: [HUDI-5569] Fixing TableFileSystemView to detect early failed commits URL: https://github.com/apache/hudi/pull/7738 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #7738: [HUDI-5569] Fixing TableFileSystemView to detect early failed commits
nsivabalan commented on PR #7738: URL: https://github.com/apache/hudi/pull/7738#issuecomment-1571131832 Closing this in favor of https://github.com/apache/hudi/pull/8783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8783: Maintain commit timeline even in case of long standing inflights
nsivabalan commented on code in PR #8783: URL: https://github.com/apache/hudi/pull/8783#discussion_r1212425833 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -460,19 +465,12 @@ private Stream getCommitInstantsToArchive() throws IOException { return !(firstSavepoint.isPresent() && compareTimestamps(firstSavepoint.get().getTimestamp(), LESSER_THAN_OR_EQUALS, s.getTimestamp())); } }).filter(s -> { -// Ensure commits >= the oldest pending compaction/replace commit is retained -return oldestPendingCompactionAndReplaceInstant +// oldestCommitToRetain is the highest completed commit instant that is less than the oldest inflight instant. +// By filter out any commit >= oldestCommitToRetain, we can ensure there are no gaps in the timeline +// when inflight commits are present. +return oldestCommitToRetain Review Comment: this simplifies a lot :) ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java: ## @@ -1416,6 +1418,91 @@ public void testArchivalWithMaxDeltaCommitsGuaranteeForCompaction(boolean enable } } + @Test Review Comment: can we add some java docs on what we are testing here. simple illustration of timeline might also help ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java: ## @@ -1416,6 +1418,91 @@ public void testArchivalWithMaxDeltaCommitsGuaranteeForCompaction(boolean enable } } + @Test + public void testGetCommitInstantsToArchiveDuringInflightCommits() throws Exception { +HoodieWriteConfig cfg = initTestTableAndGetWriteConfig(true, 4, 5, 5); +HoodieTestDataGenerator.createCommitFile(basePath, "100", wrapperFs.getConf()); + +HoodieTestDataGenerator.createReplaceCommitRequestedFile(basePath, "101", wrapperFs.getConf()); +HoodieTestDataGenerator.createReplaceCommitInflightFile(basePath, "101", wrapperFs.getConf()); +HoodieTestDataGenerator.createCommitFile(basePath, "102", wrapperFs.getConf()); +HoodieTestDataGenerator.createCommitFile(basePath, "103", wrapperFs.getConf()); +HoodieTestDataGenerator.createCompactionRequestedFile(basePath, "104", wrapperFs.getConf()); +HoodieTestDataGenerator.createCompactionAuxiliaryMetadata(basePath, +new HoodieInstant(State.REQUESTED, HoodieTimeline.COMPACTION_ACTION, "104"), wrapperFs.getConf()); +HoodieTestDataGenerator.createCommitFile(basePath, "105", wrapperFs.getConf()); +HoodieTestDataGenerator.createCommitFile(basePath, "106", wrapperFs.getConf()); +HoodieTestDataGenerator.createCommitFile(basePath, "107", wrapperFs.getConf()); +HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient); +HoodieTimelineArchiver archiver = new HoodieTimelineArchiver(cfg, table); + +HoodieTimeline timeline = metaClient.getActiveTimeline().getWriteTimeline();//getCommitsAndCompactionTimeline(); +assertEquals(8, timeline.countInstants(), "Loaded 8 commits and the count should match"); +boolean result = archiver.archiveIfRequired(context); +assertTrue(result); +timeline = metaClient.getActiveTimeline().reload().getWriteTimeline();//getCommitsAndCompactionTimeline(); +assertTrue(timeline.containsInstant(new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, "100")), +"At least one instant before oldest pending replacecommit need to stay in the timeline"); +assertEquals(8, timeline.countInstants(), +"Since we have a pending replacecommit at 101, we should never archive any commit " ++ "after 101 and also to maintain timeline have at least one completed commit before pending commit"); +assertTrue(timeline.containsInstant(new HoodieInstant(State.INFLIGHT, HoodieTimeline.REPLACE_COMMIT_ACTION, "101")), +"Inflight replacecommit must still be present"); +assertTrue(timeline.containsInstant(new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, "102")), +"Instants greater than oldest pending commit must be present"); +assertTrue(timeline.containsInstant(new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, "103")), +"Instants greater than oldest pending commit must be present"); +assertTrue(timeline.containsInstant(new HoodieInstant(State.REQUESTED, HoodieTimeline.COMPACTION_ACTION, "104")), +"Instants greater than oldest pending commit must be present"); +assertTrue(timeline.containsInstant(new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, "105")), +"Instants greater than oldest pending commit must be present"); +assertTrue(timeline.containsInstant(new HoodieInstant(false, HoodieTimeline.COMMIT_ACTION, "106")), +"Instants greater than oldest pending commit must be present"); +
[GitHub] [hudi] jonvex opened a new pull request, #8859: set operation to bulk insert on fresh table
jonvex opened a new pull request, #8859: URL: https://github.com/apache/hudi/pull/8859 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-157898 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * abacb628b9d61b0263f451e34daad0fea65cf496 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17534) * 720b8aec68410a1e48e57fc540b1885bee6a7423 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17535) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571106196 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 4c11d10224ebd368cdb83867ec8dbca1c56993f4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17532) * abacb628b9d61b0263f451e34daad0fea65cf496 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17534) * 720b8aec68410a1e48e57fc540b1885bee6a7423 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8858: [HUDI-6301][UBER] Support SqlFileBasedSource for DeltaStreamer
hudi-bot commented on PR #8858: URL: https://github.com/apache/hudi/pull/8858#issuecomment-1571101921 ## CI report: * ac360c703518f4f7a468c86e93f08cbcbacf22b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17530) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571101877 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * 4c11d10224ebd368cdb83867ec8dbca1c56993f4 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17532) * abacb628b9d61b0263f451e34daad0fea65cf496 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17534) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571101813 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * ffc65bc577ff5665a4b16f1f3759586fd0b01d2f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17531) * 19630df001d430cdf4ec6413c6f025e48bb84bb3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17533) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8829: [HUDI-6277][UBER] Clustering enhancements
nsivabalan commented on code in PR #8829: URL: https://github.com/apache/hudi/pull/8829#discussion_r1212410928 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java: ## @@ -57,9 +57,11 @@ protected void validateUsingQuery(String query, String prevTableSnapshot, String String queryWithPrevSnapshot = query.replaceAll(HoodiePreCommitValidatorConfig.VALIDATOR_TABLE_VARIABLE, prevTableSnapshot); String queryWithNewSnapshot = query.replaceAll(HoodiePreCommitValidatorConfig.VALIDATOR_TABLE_VARIABLE, newTableSnapshot); LOG.info("Running query on previous state: " + queryWithPrevSnapshot); -Dataset prevRows = sqlContext.sql(queryWithPrevSnapshot); +Dataset prevRows = sqlContext.sql(queryWithPrevSnapshot).cache(); +LOG.info("Total rows in prevRows " + prevRows.count()); LOG.info("Running query on new state: " + queryWithNewSnapshot); -Dataset newRows = sqlContext.sql(queryWithNewSnapshot); +Dataset newRows = sqlContext.sql(queryWithNewSnapshot).cache(); +LOG.info("Total rows in newRows " + newRows.count()); Review Comment: should we say "total rows in after state" or something. "newRows" seems not a good name ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java: ## @@ -56,9 +56,11 @@ protected void validateUsingQuery(String query, String prevTableSnapshot, String String queryWithPrevSnapshot = query.replaceAll(HoodiePreCommitValidatorConfig.VALIDATOR_TABLE_VARIABLE, prevTableSnapshot); String queryWithNewSnapshot = query.replaceAll(HoodiePreCommitValidatorConfig.VALIDATOR_TABLE_VARIABLE, newTableSnapshot); LOG.info("Running query on previous state: " + queryWithPrevSnapshot); -Dataset prevRows = sqlContext.sql(queryWithPrevSnapshot); +Dataset prevRows = sqlContext.sql(queryWithPrevSnapshot).cache(); +LOG.info("Total rows in prevRows " + prevRows.count()); LOG.info("Running query on new state: " + queryWithNewSnapshot); -Dataset newRows = sqlContext.sql(queryWithNewSnapshot); +Dataset newRows = sqlContext.sql(queryWithNewSnapshot).cache(); +LOG.info("Total rows in newRows " + newRows.count()); Review Comment: same here. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java: ## @@ -90,12 +90,16 @@ public R getResult() { } } - static Function, HoodieInsertValueGenResult> getTransformer(Schema schema, + /** + * Transformer function to help transform a HoodieRecord. This transformer is used by BufferedIterator to offload some + * expensive operations of transformation to the reader thread. + */ + public Function, HoodieInsertValueGenResult> getTransformer(Schema schema, HoodieWriteConfig writeConfig) { return getTransformerInternal(schema, writeConfig); } - private static Function, HoodieInsertValueGenResult> getTransformerInternal(Schema schema, + public static Function, HoodieInsertValueGenResult> getTransformerInternal(Schema schema, Review Comment: does this need to be static then ? ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java: ## @@ -73,7 +79,23 @@ public void validate(String instantTime, HoodieWriteMetadata writeResult, Dat try { validateRecordsBeforeAndAfter(before, after, getPartitionsModified(writeResult)); } finally { - LOG.info(getClass() + " validator took " + timer.endTimer() + " ms"); + long duration = timer.endTimer(); + LOG.info(getClass() + " validator took " + duration + " ms" + ", metrics on? " + getWriteConfig().isMetricsOn()); + publishRunStats(instantTime, duration); +} + } + + /** + * Publish pre-commit validator run stats for a given commit action. + */ + private void publishRunStats(String instantTime, long duration) { +// Record validator duration metrics. +if (getWriteConfig().isMetricsOn()) { + HoodieTableMetaClient metaClient = getHoodieTable().getMetaClient(); + Option currentInstant = metaClient.getActiveTimeline() + .findInstantsAfterOrEquals(instantTime, 1) + .firstInstant(); + metrics.reportMetrics(currentInstant.get().getAction(), getClass().getSimpleName(), duration); Review Comment: not too strong on the suggestion. but we can fetch the write operation type from HoodieCommitMetadata which is within HoodieWriteMetadata. we can avoid loading the timeline if we go that route. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java: ## @@ -124,8 +125,15 @@ private HoodieData> clusteringHandleUpdate(HoodieData>>
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571074411 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * ef606cf4ab9b726ee9c9af333145b7e8079fe20a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17522) * 4c11d10224ebd368cdb83867ec8dbca1c56993f4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17532) * abacb628b9d61b0263f451e34daad0fea65cf496 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8851: [HUDI-6281] Comprehensive schema evolution supports column change with a default value
hudi-bot commented on PR #8851: URL: https://github.com/apache/hudi/pull/8851#issuecomment-1571067264 ## CI report: * 2db6852dd391973eab275dc7ef70c02bfbc5f652 UNKNOWN * 60c1399ac012bc61421f3bb1feb208decbcb6b6a UNKNOWN * ef606cf4ab9b726ee9c9af333145b7e8079fe20a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17522) * 4c11d10224ebd368cdb83867ec8dbca1c56993f4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571067210 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * ffc65bc577ff5665a4b16f1f3759586fd0b01d2f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17531) * 19630df001d430cdf4ec6413c6f025e48bb84bb3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571061554 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * ffc65bc577ff5665a4b16f1f3759586fd0b01d2f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17531) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] forest455 closed issue #8841: [SUPPORT] presto querying hudi COW table can't get latest snapshot
forest455 closed issue #8841: [SUPPORT] presto querying hudi COW table can't get latest snapshot URL: https://github.com/apache/hudi/issues/8841 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571020939 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * bf40585b9b516ff7700d78bd4fc2cc9b89854f3e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17527) * ffc65bc577ff5665a4b16f1f3759586fd0b01d2f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17531) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6302) Fix Record Level Index for update partition path config values
sivabalan narayanan created HUDI-6302: - Summary: Fix Record Level Index for update partition path config values Key: HUDI-6302 URL: https://issues.apache.org/jira/browse/HUDI-6302 Project: Apache Hudi Issue Type: Improvement Components: index, writer-core Reporter: sivabalan narayanan We are adding RLI in 0.14.0. We also need to support update partition path config. If its set to true, we might have to migrate the record to new partition. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1571006226 ## CI report: * 23a574b64681c95c17db47d4c63c86d7e0215ba9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17528) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1571006019 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * bf40585b9b516ff7700d78bd4fc2cc9b89854f3e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17527) * ffc65bc577ff5665a4b16f1f3759586fd0b01d2f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8858: [HUDI-6301][UBER] Support SqlFileBasedSource for DeltaStreamer
hudi-bot commented on PR #8858: URL: https://github.com/apache/hudi/pull/8858#issuecomment-1570948967 ## CI report: * ac360c703518f4f7a468c86e93f08cbcbacf22b4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17530) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6301) Support SqlFileBasedSource for DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6301: - Labels: pull-request-available (was: ) > Support SqlFileBasedSource for DeltaStreamer > > > Key: HUDI-6301 > URL: https://issues.apache.org/jira/browse/HUDI-6301 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Surya Prasanna Yalla >Assignee: Surya Prasanna Yalla >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8858: [HUDI-6301][UBER] Support SqlFileBasedSource for DeltaStreamer
hudi-bot commented on PR #8858: URL: https://github.com/apache/hudi/pull/8858#issuecomment-1570939581 ## CI report: * ac360c703518f4f7a468c86e93f08cbcbacf22b4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6301) Support SqlFileBasedSource for DeltaStreamer
Surya Prasanna Yalla created HUDI-6301: -- Summary: Support SqlFileBasedSource for DeltaStreamer Key: HUDI-6301 URL: https://issues.apache.org/jira/browse/HUDI-6301 Project: Apache Hudi Issue Type: New Feature Reporter: Surya Prasanna Yalla Assignee: Surya Prasanna Yalla -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] suryaprasanna opened a new pull request, #8858: [UBER] Support SqlFileBasedSource for DeltaStreamer
suryaprasanna opened a new pull request, #8858: URL: https://github.com/apache/hudi/pull/8858 ### Change Logs Introducing a new Source type called SqlFileBasedSource which can be used to create complex SQL statements by and caches datasets in-memory and running complex joins on them. ### Impact This is a new feature should not impact existing use cases. ### Risk level (write none, low medium or high below) None. ### Documentation Update Some documentation is added within the classes, but since new configs and source are introduced it requires updates to the ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org