[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot commented on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028678659 ## CI report: * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) * 540459a7e5a71f01b8e052424ccabad5a25b840e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5698) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot removed a comment on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028676981 ## CI report: * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) * 540459a7e5a71f01b8e052424ccabad5a25b840e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot commented on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028676981 ## CI report: * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) * 540459a7e5a71f01b8e052424ccabad5a25b840e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot removed a comment on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028584032 ## CI report: * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot removed a comment on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1028335503 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * d78e61c34acac7e23477f196388076cdd822dd69 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5679) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5681) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot commented on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1028676669 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * d78e61c34acac7e23477f196388076cdd822dd69 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5679) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5681) * e48904554517673727ee5a0cb4055579f39e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028601646 ## CI report: * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028572870 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
alexeykudinkin commented on a change in pull request #4556: URL: https://github.com/apache/hudi/pull/4556#discussion_r798204981 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -65,11 +65,70 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.TypeUtils.unsafeCast; +import static org.apache.hudi.common.util.ValidationUtils.checkState; + public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); - public static InputSplit[] getRealtimeSplits(Configuration conf, Stream fileSplits) { + public static InputSplit[] getRealtimeSplits(Configuration conf, List fileSplits) throws IOException { +if (fileSplits.isEmpty()) { + return new InputSplit[0]; +} + +FileSplit fileSplit = fileSplits.get(0); + +// Pre-process table-config to fetch virtual key info +Path partitionPath = fileSplit.getPath().getParent(); +HoodieTableMetaClient metaClient = getTableMetaClientForBasePathUnchecked(conf, partitionPath); + +Option hoodieVirtualKeyInfoOpt = getHoodieVirtualKeyInfo(metaClient); + +// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase} +HoodieInstant latestCommitInstant = + metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get(); + +InputSplit[] finalSplits = fileSplits.stream() + .map(split -> { +// There are 4 types of splits could we have to handle here +//- {@code BootstrapBaseFileSplit}: in case base file does have associated bootstrap file, +// but does NOT have any log files appended (convert it to {@code RealtimeBootstrapBaseFileSplit}) +//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does have associated bootstrap file +// and does have log files appended +//- {@code BaseFileWithLogsSplit}: in case base file does NOT have associated bootstrap file +// and does have log files appended; +//- {@code FileSplit}: in case Hive passed down non-Hudi path +if (split instanceof RealtimeBootstrapBaseFileSplit) { + return split; +} else if (split instanceof BootstrapBaseFileSplit) { + BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split); + return createRealtimeBoostrapBaseFileSplit( + bootstrapBaseFileSplit, + metaClient.getBasePath(), + Collections.emptyList(), + latestCommitInstant.getTimestamp(), + false); +} else if (split instanceof BaseFileWithLogsSplit) { + BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split); Review comment: Yes, it's in sync. However, you brought up a very good point that the instant shouldn't actually be set here. This will be cleaned up in subsequent PRs where `HoodieRealtimeFileSplit` will be merged with `BaseWithLogFilesSplit` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
alexeykudinkin commented on a change in pull request #4556: URL: https://github.com/apache/hudi/pull/4556#discussion_r798204981 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -65,11 +65,70 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.TypeUtils.unsafeCast; +import static org.apache.hudi.common.util.ValidationUtils.checkState; + public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); - public static InputSplit[] getRealtimeSplits(Configuration conf, Stream fileSplits) { + public static InputSplit[] getRealtimeSplits(Configuration conf, List fileSplits) throws IOException { +if (fileSplits.isEmpty()) { + return new InputSplit[0]; +} + +FileSplit fileSplit = fileSplits.get(0); + +// Pre-process table-config to fetch virtual key info +Path partitionPath = fileSplit.getPath().getParent(); +HoodieTableMetaClient metaClient = getTableMetaClientForBasePathUnchecked(conf, partitionPath); + +Option hoodieVirtualKeyInfoOpt = getHoodieVirtualKeyInfo(metaClient); + +// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase} +HoodieInstant latestCommitInstant = + metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get(); + +InputSplit[] finalSplits = fileSplits.stream() + .map(split -> { +// There are 4 types of splits could we have to handle here +//- {@code BootstrapBaseFileSplit}: in case base file does have associated bootstrap file, +// but does NOT have any log files appended (convert it to {@code RealtimeBootstrapBaseFileSplit}) +//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does have associated bootstrap file +// and does have log files appended +//- {@code BaseFileWithLogsSplit}: in case base file does NOT have associated bootstrap file +// and does have log files appended; +//- {@code FileSplit}: in case Hive passed down non-Hudi path +if (split instanceof RealtimeBootstrapBaseFileSplit) { + return split; +} else if (split instanceof BootstrapBaseFileSplit) { + BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split); + return createRealtimeBoostrapBaseFileSplit( + bootstrapBaseFileSplit, + metaClient.getBasePath(), + Collections.emptyList(), + latestCommitInstant.getTimestamp(), + false); +} else if (split instanceof BaseFileWithLogsSplit) { + BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split); Review comment: It does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-3337) ParquetUtils fails extracting Parquet Column Range Metadata
[ https://issues.apache.org/jira/browse/HUDI-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-3337. - Resolution: Fixed > ParquetUtils fails extracting Parquet Column Range Metadata > --- > > Key: HUDI-3337 > URL: https://issues.apache.org/jira/browse/HUDI-3337 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > [~manojpec] discovered following issue while testing MT flows, with > {{TestHoodieBackedMetadata#testTableOperationsWithMetadataIndex}} failing > with: > > {code:java} > 17400 [Executor task launch worker for task 240] ERROR > org.apache.hudi.metadata.HoodieTableMetadataUtil - Failed to read column > stats for > /var/folders/t7/kr69rlvx5rdd824m61zjqkjrgn/T/junit2402861080324269156/dataset/2016/03/15/44396fda-48db-4d10-9f47-275c39317115-0_0-101-234_003.parquet > java.lang.ClassCastException: > org.apache.parquet.io.api.Binary$ByteArrayBackedBinary cannot be cast to > java.lang.Integer > at > org.apache.hudi.common.util.ParquetUtils.convertToNativeJavaType(ParquetUtils.java:369) > at > org.apache.hudi.common.util.ParquetUtils.lambda$null$2(ParquetUtils.java:305) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) > at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) > at > org.apache.hudi.common.util.ParquetUtils.readRangeFromParquetMetadata(ParquetUtils.java:313) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getColumnStats(HoodieTableMetadataUtil.java:878) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.translateWriteStatToColumnStats(HoodieTableMetadataUtil.java:858) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$createColumnStatsFromWriteStats$7e2376a$1(HoodieTableMetadataUtil.java:819) > at > org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:134) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at >
[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot commented on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028584032 ## CI report: * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot removed a comment on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028556982 ## CI report: * 891d9658daa099eb50560741086aac23924e3600 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669) * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
yihua commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798194069 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - ONCE : This validator will compare the result of listing partitions/listing files between metadata table and filesystem only once. + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode ONCE + * ``` + * + */ +public class HoodieMetadataTableValidator { + + private static final Logger LOG = LogManager.getLogger(HoodieMetadataTableValidator.class); + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props; + + private HoodieTableMetaClient metaClient; + + protected transient Option asyncMetadataTableValidateService; + + public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) { +this.metaClient = metaClient; + } + + public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) { +this.jsc = jsc; +this.cfg = cfg; + +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028572870 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028571985 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) * 071c6180b5023f782da229552a6d3f63d1e4a67b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028486554 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028571985 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) * 071c6180b5023f782da229552a6d3f63d1e4a67b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3362) Hudi 0.8.0 cannot roleback MoR table
[ https://issues.apache.org/jira/browse/HUDI-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486197#comment-17486197 ] sivabalan narayanan commented on HUDI-3362: --- [~vinish_jail97] : we might need to test the savepoint restore w/ clustering sooner. may be there are some gaps. > Hudi 0.8.0 cannot roleback MoR table > > > Key: HUDI-3362 > URL: https://issues.apache.org/jira/browse/HUDI-3362 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Istvan Darvas >Priority: Blocker > Attachments: hoodie.zip, rollback_20220131215514.txt, > rollback_log.txt, rollback_log_v2.txt > > > Hi Guys, > > Environment: AWS EMR 6.4 / Hudi v0.8.0 > Problem: I have a MoR table wich is ingested by DeltaStremer (batch style: > every 5 minutes from Kafka), and after a certain time, DeltaStremer stops > working with a message like this: > > {{diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieRollbackException: Found commits after time > :20220131215051, please rollback greater commits first}} > > It is usually a replace commit, I would say I am pretty sure in this. > I have commits in the timeline: > > 20220131214354<-before > 20220131215051<-error message > 20220131215514<-after > > So as it was told to me, I try to rollback with the following steps in > hudi-cli: > 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / > SUCCESS > 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS > 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / > FAILED > 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS > 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / > FAILED > > Long story short, if I run a situation like this I am not able to solve it > with the known methods ;) - My use-case is working progress, but I cannot go > prod with an issue like this. > > My question, what would be the right steps / commands to solve an issue like > this, and be able to restart deltastremer again. > > This table, does not have dimension data, so I am happy to share the whole > table if someone curiuous (if that is needed or would be helpful, lets talk > in a private mail / slack about the sharing). ~15GB ;) it was stoped after a > few run, actually after the 1st clustering. > > I use this clustering config in the DeltaStremer: > hoodie.clustering.inline=true > hoodie.clustering.inline.enabled=true > hoodie.clustering.inline.max.commits=36 > hoodie.clustering.plan.strategy.sort.columns=correlation_id > hoodie.clustering.plan.strategy.daybased.lookback.partitions=7 > hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > hoodie.clustering.plan.strategy.small.file.limit=134217728 > hoodie.clustering.plan.strategy.max.bytes.per.group=671088640 > > I hope there is someone who can help me to tackle with this, becase if I able > to solve this manually, I would be confident to go prod. > So thanks in advance, > Darvi > Slack Hudi: istvan darvas / U02NTACPHPU -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3362) Hudi 0.8.0 cannot roleback MoR table
[ https://issues.apache.org/jira/browse/HUDI-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-3362: - Assignee: sivabalan narayanan > Hudi 0.8.0 cannot roleback MoR table > > > Key: HUDI-3362 > URL: https://issues.apache.org/jira/browse/HUDI-3362 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Istvan Darvas >Assignee: sivabalan narayanan >Priority: Blocker > Attachments: hoodie.zip, rollback_20220131215514.txt, > rollback_log.txt, rollback_log_v2.txt > > > Hi Guys, > > Environment: AWS EMR 6.4 / Hudi v0.8.0 > Problem: I have a MoR table wich is ingested by DeltaStremer (batch style: > every 5 minutes from Kafka), and after a certain time, DeltaStremer stops > working with a message like this: > > {{diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieRollbackException: Found commits after time > :20220131215051, please rollback greater commits first}} > > It is usually a replace commit, I would say I am pretty sure in this. > I have commits in the timeline: > > 20220131214354<-before > 20220131215051<-error message > 20220131215514<-after > > So as it was told to me, I try to rollback with the following steps in > hudi-cli: > 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / > SUCCESS > 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS > 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / > FAILED > 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS > 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / > FAILED > > Long story short, if I run a situation like this I am not able to solve it > with the known methods ;) - My use-case is working progress, but I cannot go > prod with an issue like this. > > My question, what would be the right steps / commands to solve an issue like > this, and be able to restart deltastremer again. > > This table, does not have dimension data, so I am happy to share the whole > table if someone curiuous (if that is needed or would be helpful, lets talk > in a private mail / slack about the sharing). ~15GB ;) it was stoped after a > few run, actually after the 1st clustering. > > I use this clustering config in the DeltaStremer: > hoodie.clustering.inline=true > hoodie.clustering.inline.enabled=true > hoodie.clustering.inline.max.commits=36 > hoodie.clustering.plan.strategy.sort.columns=correlation_id > hoodie.clustering.plan.strategy.daybased.lookback.partitions=7 > hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 > hoodie.clustering.plan.strategy.small.file.limit=134217728 > hoodie.clustering.plan.strategy.max.bytes.per.group=671088640 > > I hope there is someone who can help me to tackle with this, becase if I able > to solve this manually, I would be confident to go prod. > So thanks in advance, > Darvi > Slack Hudi: istvan darvas / U02NTACPHPU -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] zhangyue19921010 commented on pull request #4346: [HUDI-3045] New clustering regex match config to choose partitions when building clustering plan
zhangyue19921010 commented on pull request #4346: URL: https://github.com/apache/hudi/pull/4346#issuecomment-1028557537 Sure, pick it up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot commented on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028556982 ## CI report: * 891d9658daa099eb50560741086aac23924e3600 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669) * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot removed a comment on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028555830 ## CI report: * 891d9658daa099eb50560741086aac23924e3600 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669) * 86b335a77fb92cf34cb5e653694be626f6c7eba4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
zhangyue19921010 commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798177169 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - ONCE : This validator will compare the result of listing partitions/listing files between metadata table and filesystem only once. + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode ONCE + * ``` + * + */ +public class HoodieMetadataTableValidator { + + private static final Logger LOG = LogManager.getLogger(HoodieMetadataTableValidator.class); + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props; + + private HoodieTableMetaClient metaClient; + + protected transient Option asyncMetadataTableValidateService; + + public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) { +this.metaClient = metaClient; + } + + public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) { +this.jsc = jsc; +this.cfg = cfg; + +this.props = cfg.propsFilePath == null +?
[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot commented on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028555830 ## CI report: * 891d9658daa099eb50560741086aac23924e3600 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669) * 86b335a77fb92cf34cb5e653694be626f6c7eba4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator
hudi-bot removed a comment on pull request #4721: URL: https://github.com/apache/hudi/pull/4721#issuecomment-1027541190 ## CI report: * 891d9658daa099eb50560741086aac23924e3600 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
zhangyue19921010 commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798176659 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - ONCE : This validator will compare the result of listing partitions/listing files between metadata table and filesystem only once. + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode ONCE + * ``` + * + */ +public class HoodieMetadataTableValidator { + + private static final Logger LOG = LogManager.getLogger(HoodieMetadataTableValidator.class); + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props; + + private HoodieTableMetaClient metaClient; + + protected transient Option asyncMetadataTableValidateService; + + public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) { +this.metaClient = metaClient; + } + + public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) { +this.jsc = jsc; +this.cfg = cfg; + +this.props = cfg.propsFilePath == null +?
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
zhangyue19921010 commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798176560 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS Review comment: Sure, changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
zhangyue19921010 commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798176463 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: Review comment: Changed. Just two kinds of example now: 1. --continuous + --min-validate-interval-seconds. 2. default which will validate once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028547091 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028522181 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (819e801 -> d681824)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 819e801 [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716) add d681824 [HUDI-3337] Fixing Parquet Column Range metadata extraction (#4705) No new revisions were added by this update. Summary of changes: .../index/columnstats/ColumnStatsIndexHelper.java | 11 +- .../common/model/HoodieColumnRangeMetadata.java| 10 +- .../org/apache/hudi/common/util/ParquetUtils.java | 45 +++- ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json | 10 + ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json | 10 + ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json | 10 + ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json | 10 + ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json | 10 + ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json | 10 + ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json | 10 + ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json | 10 + .../index/zorder/z-index-table-merged.json | 16 +- .../test/resources/index/zorder/z-index-table.json | 8 +- .../hudi/functional/TestColumnStatsIndex.scala | 246 - .../hudi/functional/TestLayoutOptimization.scala | 18 -- 15 files changed, 323 insertions(+), 111 deletions(-) create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-0-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-1-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-2-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-3-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-0-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-1-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-2-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json create mode 100644 hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-3-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json
[GitHub] [hudi] nsivabalan merged pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
nsivabalan merged pull request #4705: URL: https://github.com/apache/hudi/pull/4705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
nsivabalan commented on a change in pull request #4556: URL: https://github.com/apache/hudi/pull/4556#discussion_r798155468 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -65,11 +65,70 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.TypeUtils.unsafeCast; +import static org.apache.hudi.common.util.ValidationUtils.checkState; + public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); - public static InputSplit[] getRealtimeSplits(Configuration conf, Stream fileSplits) { + public static InputSplit[] getRealtimeSplits(Configuration conf, List fileSplits) throws IOException { +if (fileSplits.isEmpty()) { + return new InputSplit[0]; +} + +FileSplit fileSplit = fileSplits.get(0); + +// Pre-process table-config to fetch virtual key info +Path partitionPath = fileSplit.getPath().getParent(); +HoodieTableMetaClient metaClient = getTableMetaClientForBasePathUnchecked(conf, partitionPath); + +Option hoodieVirtualKeyInfoOpt = getHoodieVirtualKeyInfo(metaClient); + +// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase} +HoodieInstant latestCommitInstant = + metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get(); + +InputSplit[] finalSplits = fileSplits.stream() + .map(split -> { +// There are 4 types of splits could we have to handle here +//- {@code BootstrapBaseFileSplit}: in case base file does have associated bootstrap file, +// but does NOT have any log files appended (convert it to {@code RealtimeBootstrapBaseFileSplit}) +//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does have associated bootstrap file +// and does have log files appended +//- {@code BaseFileWithLogsSplit}: in case base file does NOT have associated bootstrap file +// and does have log files appended; +//- {@code FileSplit}: in case Hive passed down non-Hudi path +if (split instanceof RealtimeBootstrapBaseFileSplit) { + return split; +} else if (split instanceof BootstrapBaseFileSplit) { + BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split); + return createRealtimeBoostrapBaseFileSplit( + bootstrapBaseFileSplit, + metaClient.getBasePath(), + Collections.emptyList(), + latestCommitInstant.getTimestamp(), + false); +} else if (split instanceof BaseFileWithLogsSplit) { + BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split); Review comment: does the maxCommitTime in baseFileSplit will be in sync with latestCommitInstant computed at L89. Prior to this patch, use the latestCommitInstant computed here, where as now, we just reuse the same thats comes from BaseFileWithLogsSplit. Just wanted to confirm as these are new code to me. ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -65,11 +65,70 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.TypeUtils.unsafeCast; +import static org.apache.hudi.common.util.ValidationUtils.checkState; + public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); - public static InputSplit[] getRealtimeSplits(Configuration conf, Stream fileSplits) { + public static InputSplit[] getRealtimeSplits(Configuration conf, List fileSplits) throws IOException { Review comment: this refactoring makes total sense assuming each FileSplit will correspond to one FileSlice. and there won't be a case where multiple FileSplits can store info about a single FileSlice. thanks for doing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028522181 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028513791 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
alexeykudinkin commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028521365 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot commented on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028521054 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot removed a comment on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028469893 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028462344 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 34b94b96b03109555201092bfabce21793add437 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028513791 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests
Ethan Guo created HUDI-3366: --- Summary: Remove unnecessary hardcoded logic of disabling metadata table in tests Key: HUDI-3366 URL: https://issues.apache.org/jira/browse/HUDI-3366 Project: Apache Hudi Issue Type: Task Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests
[ https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3366: Fix Version/s: 0.11.0 > Remove unnecessary hardcoded logic of disabling metadata table in tests > --- > > Key: HUDI-3366 > URL: https://issues.apache.org/jira/browse/HUDI-3366 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests
[ https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3366: Priority: Blocker (was: Major) > Remove unnecessary hardcoded logic of disabling metadata table in tests > --- > > Key: HUDI-3366 > URL: https://issues.apache.org/jira/browse/HUDI-3366 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests
[ https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-3366: --- Assignee: Ethan Guo > Remove unnecessary hardcoded logic of disabling metadata table in tests > --- > > Key: HUDI-3366 > URL: https://issues.apache.org/jira/browse/HUDI-3366 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot commented on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028504428 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot removed a comment on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028459357 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
yihua commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798125748 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - ONCE : This validator will compare the result of listing partitions/listing files between metadata table and filesystem only once. + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode ONCE + * ``` + * + */ +public class HoodieMetadataTableValidator { + + private static final Logger LOG = LogManager.getLogger(HoodieMetadataTableValidator.class); + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props; + + private HoodieTableMetaClient metaClient; + + protected transient Option asyncMetadataTableValidateService; + + public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) { +this.metaClient = metaClient; + } + + public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) { +this.jsc = jsc; +this.cfg = cfg; + +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +
[GitHub] [hudi] yihua commented on issue #4666: [SUPPORT] Table downgrade fails to delete non-existing file
yihua commented on issue #4666: URL: https://github.com/apache/hudi/issues/4666#issuecomment-1028487024 > hey @ganczarek : looks like there is a bug https://issues.apache.org/jira/browse/HUDI-3346. we will work on the fix. should be straight forward. Atleast in this case, manually deleting the commit meta file just for this instant should be fine. Here is what could have resulted in this. > > Just before downgrade, a commit was started, but before going into inflight or before a single marker file could be created, the process crashed. And so there is no marker dir only created for this commit. Downgrade code missed to check for existence in one place (but there are other places where this check is made) and so it failed. > > I have created a tracking [ticket](https://issues.apache.org/jira/browse/HUDI-3346) here. We are good to close this. Sorry, I missed this. There should be a check for marker directory before trying to delete it, which is missing before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028432638 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028486554 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
alexeykudinkin commented on a change in pull request #4739: URL: https://github.com/apache/hudi/pull/4739#discussion_r798117376 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java ## @@ -87,40 +89,58 @@ public static void deleteMetadataTable(String basePath, HoodieEngineContext cont * @return a list of metadata table records */ public static List convertMetadataToRecords(HoodieCommitMetadata commitMetadata, String instantTime) { -List records = new LinkedList<>(); -List allPartitions = new LinkedList<>(); -commitMetadata.getPartitionToWriteStats().forEach((partitionStatName, writeStats) -> { - final String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionStatName; - allPartitions.add(partition); - - Map newFiles = new HashMap<>(writeStats.size()); - writeStats.forEach(hoodieWriteStat -> { -String pathWithPartition = hoodieWriteStat.getPath(); -if (pathWithPartition == null) { - // Empty partition - LOG.warn("Unable to find path in write stat to update metadata table " + hoodieWriteStat); - return; -} - -int offset = partition.equals(NON_PARTITIONED_NAME) ? (pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1; -String filename = pathWithPartition.substring(offset); -long totalWriteBytes = newFiles.containsKey(filename) -? newFiles.get(filename) + hoodieWriteStat.getTotalWriteBytes() -: hoodieWriteStat.getTotalWriteBytes(); -newFiles.put(filename, totalWriteBytes); - }); - // New files added to a partition - HoodieRecord record = HoodieMetadataPayload.createPartitionFilesRecord( - partition, Option.of(newFiles), Option.empty()); - records.add(record); -}); +List records = new ArrayList<>(commitMetadata.getPartitionToWriteStats().size()); + +// Add record bearing partitions list +ArrayList partitionsList = new ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet()); + + records.add(HoodieMetadataPayload.createPartitionListRecord(partitionsList)); + +// New files added to a partition +List> updatedFilesRecords = +commitMetadata.getPartitionToWriteStats().entrySet() +.stream() +.map(entry -> { + String partitionStatName = entry.getKey(); + List writeStats = entry.getValue(); + + String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionStatName; + + HashMap updatedFilesToSizesMapping = + writeStats.stream().reduce(new HashMap<>(writeStats.size()), + (map, stat) -> { +String pathWithPartition = stat.getPath(); +if (pathWithPartition == null) { + // Empty partition + LOG.warn("Unable to find path in write stat to update metadata table " + stat); + return map; +} + +int offset = partition.equals(NON_PARTITIONED_NAME) +? (pathWithPartition.startsWith("/") ? 1 : 0) +: partition.length() + 1; +String filename = pathWithPartition.substring(offset); + +// Since write-stats are coming in no particular order, if the same +// file have previously been appended to w/in the txn, we simply pick max +// of the sizes as reported after every write, since file-sizes are +// monotonically increasing (ie file-size never goes down, unless deleted) +map.merge(filename, stat.getFileSizeInBytes(), Math::max); Review comment: It does -- only case where we might provide something other than the file-size is `AppendHandle`, and it does set this to the full file size (it's a contract of this API) https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java#L417 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
alexeykudinkin commented on a change in pull request #4739: URL: https://github.com/apache/hudi/pull/4739#discussion_r798116440 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java ## @@ -215,19 +215,16 @@ public HoodieMetadataPayload preCombine(HoodieMetadataPayload previousRecord) { if (filesystemMetadata != null) { filesystemMetadata.forEach((filename, fileInfo) -> { -// If the filename wasnt present then we carry it forward -if (!combinedFileInfo.containsKey(filename)) { - combinedFileInfo.put(filename, fileInfo); +if (fileInfo.getIsDeleted()) { + combinedFileInfo.remove(filename); } else { - if (fileInfo.getIsDeleted()) { -// file deletion -combinedFileInfo.remove(filename); - } else { -// file appends. -combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, newFileInfo) -> { - return new HoodieMetadataFileInfo(oldFileInfo.getSize() + newFileInfo.getSize(), false); -}); - } + // NOTE: There are 2 possible cases here: + //- New file is created: in that case we're simply adding its info + //- File is appended to (only log-files of MOR tables on supported FS): in that case + // we simply pick the info w/ largest file-size as the most recent one, since file's + // sizes are increasing monotonically (meaning that the larger file-size is more recent one) + combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, newFileInfo) -> Review comment: Correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #4679: [HUDI-3315] RFC-35 Make Flink writer stream friendly
yihua commented on pull request #4679: URL: https://github.com/apache/hudi/pull/4679#issuecomment-1028475241 > cc @yihua could you also please review this from the angle of making the write client abstractions more friendly I'll review this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin closed pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`
alexeykudinkin closed pull request #4560: URL: https://github.com/apache/hudi/pull/4560 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`
alexeykudinkin commented on pull request #4560: URL: https://github.com/apache/hudi/pull/4560#issuecomment-1028472992 Given that we're on a path to eventually deprecate `HoodieRecordPayload` this is unnecessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot removed a comment on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028457685 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot commented on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028469893 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning
[ https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486166#comment-17486166 ] Alexey Kudinkin commented on HUDI-1296: --- I've done high-level scoping of this effort, on a high-level we'd need to: # Implement base HFile (Spark-compatible) reader ## Similar to ParquetFileFormat.buildReaderWithPartitionValues ## Used in MergeOnRead\{Snapshot|Incremental}Relation, passed to HoodieMergeOnReadRDD # Modify MergeOnReadSnapshotRelation to not assume the base file format and instead deduce it based on extension > Implement Spark DataSource using range metadata for file/partition pruning > -- > > Key: HUDI-1296 > URL: https://issues.apache.org/jira/browse/HUDI-1296 > Project: Apache Hudi > Issue Type: Task > Components: spark >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning
[ https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-1296: -- Story Points: 6 (was: 4) > Implement Spark DataSource using range metadata for file/partition pruning > -- > > Key: HUDI-1296 > URL: https://issues.apache.org/jira/browse/HUDI-1296 > Project: Apache Hudi > Issue Type: Task > Components: spark >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua commented on issue #4600: [SUPPORT]When hive queries Hudi data, the query path is wrong
yihua commented on issue #4600: URL: https://github.com/apache/hudi/issues/4600#issuecomment-1028469418 As @xiarixiaoyao mentions, compaction should compact log files into base file formats like parquet, which can be then read by Hive. There are different ways to trigger compaction, e.g., through inline/sync compaction, standalone HoodieCompactor, hudi-cli commands. @danny0405 does Flink SQL and writer support compaction? @gubinjie When you refer to Kafka connector, do you mean Flink Kafka connector? In Hudi, we also provide Kafka Connect Sink for Hudi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication
hudi-bot commented on pull request #4559: URL: https://github.com/apache/hudi/pull/4559#issuecomment-1028468345 ## CI report: * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN * 6d402f17d668a5a2d8bec5b8094fad6e997407b8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5688) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication
hudi-bot removed a comment on pull request #4559: URL: https://github.com/apache/hudi/pull/4559#issuecomment-1028380183 ## CI report: * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN * c18bd337517c5842f2db1ee0075df19c05fafe91 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5487) * 6d402f17d668a5a2d8bec5b8094fad6e997407b8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5688) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator
nsivabalan commented on a change in pull request #4721: URL: https://github.com/apache/hudi/pull/4721#discussion_r798108001 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -0,0 +1,458 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.async.HoodieAsyncService; +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Comparator; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.stream.Collectors; + +/** + * A validator with spark-submit to compare list partitions and list files between metadata table and filesystem + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - CONTINUOUS : This validator will compare the result of listing partitions/listing files between metadata table and filesystem every 10 minutes(default). + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode CONTINUOUS + * ``` + * + * + * You can specify the running mode of the validator through `--mode`. + * There are 2 modes of the {@link HoodieMetadataTableValidator}: + * - ONCE : This validator will compare the result of listing partitions/listing files between metadata table and filesystem only once. + * + * Example command: + * ``` + * spark-submit \ + * --class org.apache.hudi.utilities.HoodieMetadataTableValidator \ + * --packages org.apache.spark:spark-avro_2.11:2.4.4 \ + * --master spark://:7077 \ + * --driver-memory 1g \ + * --executor-memory 1g \ + * $HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar \ + * --base-path basePath \ + * --min-validate-interval-seconds 60 \ + * --mode ONCE + * ``` + * + */ +public class HoodieMetadataTableValidator { + + private static final Logger LOG = LogManager.getLogger(HoodieMetadataTableValidator.class); + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props; + + private HoodieTableMetaClient metaClient; + + protected transient Option asyncMetadataTableValidateService; + + public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) { +this.metaClient = metaClient; + } + + public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) { +this.jsc = jsc; +this.cfg = cfg; + +this.props = cfg.propsFilePath == null +? UtilHelpers.buildProperties(cfg.configs) +
[GitHub] [hudi] nsivabalan closed pull request #4734: [DO NOT MERGE] Testing CI 1
nsivabalan closed pull request #4734: URL: https://github.com/apache/hudi/pull/4734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #4732: [DO_NOT_MERGE][WIP] debugging some tests for metadata restore
nsivabalan closed pull request #4732: URL: https://github.com/apache/hudi/pull/4732 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #4736: [DO NOT MERGE] Testing CI 3
nsivabalan closed pull request #4736: URL: https://github.com/apache/hudi/pull/4736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #4735: [DO NOT MERGE] Testing CI 2
nsivabalan closed pull request #4735: URL: https://github.com/apache/hudi/pull/4735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #4737: [DO NOT MERGE] Testing CI 4
nsivabalan closed pull request #4737: URL: https://github.com/apache/hudi/pull/4737 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #4738: [DO NOT MERGE] Testing CI 5
nsivabalan closed pull request #4738: URL: https://github.com/apache/hudi/pull/4738 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028457463 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 34b94b96b03109555201092bfabce21793add437 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028462344 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 34b94b96b03109555201092bfabce21793add437 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
nsivabalan commented on a change in pull request #4739: URL: https://github.com/apache/hudi/pull/4739#discussion_r798098988 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java ## @@ -215,19 +215,16 @@ public HoodieMetadataPayload preCombine(HoodieMetadataPayload previousRecord) { if (filesystemMetadata != null) { filesystemMetadata.forEach((filename, fileInfo) -> { -// If the filename wasnt present then we carry it forward -if (!combinedFileInfo.containsKey(filename)) { - combinedFileInfo.put(filename, fileInfo); +if (fileInfo.getIsDeleted()) { + combinedFileInfo.remove(filename); } else { - if (fileInfo.getIsDeleted()) { -// file deletion -combinedFileInfo.remove(filename); - } else { -// file appends. -combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, newFileInfo) -> { - return new HoodieMetadataFileInfo(oldFileInfo.getSize() + newFileInfo.getSize(), false); -}); - } + // NOTE: There are 2 possible cases here: + //- New file is created: in that case we're simply adding its info + //- File is appended to (only log-files of MOR tables on supported FS): in that case + // we simply pick the info w/ largest file-size as the most recent one, since file's + // sizes are increasing monotonically (meaning that the larger file-size is more recent one) + combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, newFileInfo) -> Review comment: merge func takes care of adding an entry for the first time and hence remove L219 and 220 ? ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java ## @@ -87,40 +89,58 @@ public static void deleteMetadataTable(String basePath, HoodieEngineContext cont * @return a list of metadata table records */ public static List convertMetadataToRecords(HoodieCommitMetadata commitMetadata, String instantTime) { -List records = new LinkedList<>(); -List allPartitions = new LinkedList<>(); -commitMetadata.getPartitionToWriteStats().forEach((partitionStatName, writeStats) -> { - final String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionStatName; - allPartitions.add(partition); - - Map newFiles = new HashMap<>(writeStats.size()); - writeStats.forEach(hoodieWriteStat -> { -String pathWithPartition = hoodieWriteStat.getPath(); -if (pathWithPartition == null) { - // Empty partition - LOG.warn("Unable to find path in write stat to update metadata table " + hoodieWriteStat); - return; -} - -int offset = partition.equals(NON_PARTITIONED_NAME) ? (pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1; -String filename = pathWithPartition.substring(offset); -long totalWriteBytes = newFiles.containsKey(filename) -? newFiles.get(filename) + hoodieWriteStat.getTotalWriteBytes() -: hoodieWriteStat.getTotalWriteBytes(); -newFiles.put(filename, totalWriteBytes); - }); - // New files added to a partition - HoodieRecord record = HoodieMetadataPayload.createPartitionFilesRecord( - partition, Option.of(newFiles), Option.empty()); - records.add(record); -}); +List records = new ArrayList<>(commitMetadata.getPartitionToWriteStats().size()); + +// Add record bearing partitions list +ArrayList partitionsList = new ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet()); + + records.add(HoodieMetadataPayload.createPartitionListRecord(partitionsList)); + +// New files added to a partition +List> updatedFilesRecords = +commitMetadata.getPartitionToWriteStats().entrySet() +.stream() +.map(entry -> { + String partitionStatName = entry.getKey(); + List writeStats = entry.getValue(); + + String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : partitionStatName; + + HashMap updatedFilesToSizesMapping = + writeStats.stream().reduce(new HashMap<>(writeStats.size()), + (map, stat) -> { +String pathWithPartition = stat.getPath(); +if (pathWithPartition == null) { + // Empty partition + LOG.warn("Unable to find path in write stat to update metadata table " + stat); + return map; +} + +int offset = partition.equals(NON_PARTITIONED_NAME) +? (pathWithPartition.startsWith("/") ? 1 : 0) +:
[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot commented on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028459357 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot removed a comment on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028457602 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot commented on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028457685 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot removed a comment on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028455703 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028457463 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * 34b94b96b03109555201092bfabce21793add437 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot commented on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028457602 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028455516 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665) * 34b94b96b03109555201092bfabce21793add437 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type
hudi-bot removed a comment on pull request #4659: URL: https://github.com/apache/hudi/pull/4659#issuecomment-1019743003 ## CI report: * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #3866: [HUDI-1430] SparkDataFrameWriteClient
yihua commented on pull request #3866: URL: https://github.com/apache/hudi/pull/3866#issuecomment-1028456039 @xushiyan when this is mostly in shape, we can go through the code again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot removed a comment on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028430624 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot commented on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028455703 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028455516 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665) * 34b94b96b03109555201092bfabce21793add437 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
hudi-bot removed a comment on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028382543 ## CI report: * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665) * 34b94b96b03109555201092bfabce21793add437 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
alexeykudinkin commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028453307 @yihua yeah, it's rebased on master now and ready for another pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
alexeykudinkin commented on a change in pull request #4556: URL: https://github.com/apache/hudi/pull/4556#discussion_r798092122 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimeFileSplit.java ## @@ -44,9 +44,7 @@ private Option hoodieVirtualKeyInfo = Option.empty(); - public HoodieRealtimeFileSplit() { -super(); Review comment: We don't need to remove it, but there's also no point in keeping it ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -144,28 +204,32 @@ return rtSplits.toArray(new InputSplit[0]); } + /** + * @deprecated will be replaced w/ {@link #getRealtimeSplits(Configuration, List)} + */ // get IncrementalRealtimeSplits - public static InputSplit[] getIncrementalRealtimeSplits(Configuration conf, Stream fileSplits) throws IOException { + public static InputSplit[] getIncrementalRealtimeSplits(Configuration conf, List fileSplits) throws IOException { + checkState(fileSplits.stream().allMatch(HoodieRealtimeInputFormatUtils::doesBelongToIncrementalQuery), +"All splits have to belong to incremental query"); + List rtSplits = new ArrayList<>(); -List fileSplitList = fileSplits.collect(Collectors.toList()); -Set partitionSet = fileSplitList.stream().map(f -> f.getPath().getParent()).collect(Collectors.toSet()); +Set partitionSet = fileSplits.stream().map(f -> f.getPath().getParent()).collect(Collectors.toSet()); Map partitionsToMetaClient = getTableMetaClientByPartitionPath(conf, partitionSet); // Pre process tableConfig from first partition to fetch virtual key info Option hoodieVirtualKeyInfo = Option.empty(); if (partitionSet.size() > 0) { hoodieVirtualKeyInfo = getHoodieVirtualKeyInfo(partitionsToMetaClient.get(partitionSet.iterator().next())); } Option finalHoodieVirtualKeyInfo = hoodieVirtualKeyInfo; -fileSplitList.stream().forEach(s -> { +fileSplits.stream().forEach(s -> { // deal with incremental query. try { if (s instanceof BaseFileWithLogsSplit) { - BaseFileWithLogsSplit bs = (BaseFileWithLogsSplit)s; - if (bs.getBelongToIncrementalSplit()) { -rtSplits.add(new HoodieRealtimeFileSplit(bs, bs.getBasePath(), bs.getDeltaLogFiles(), bs.getMaxCommitTime(), finalHoodieVirtualKeyInfo)); - } + BaseFileWithLogsSplit bs = unsafeCast(s); + rtSplits.add(new HoodieRealtimeFileSplit(bs, bs.getBasePath(), bs.getDeltaLogFiles(), bs.getMaxCommitTime(), finalHoodieVirtualKeyInfo)); } else if (s instanceof RealtimeBootstrapBaseFileSplit) { - rtSplits.add(s); + RealtimeBootstrapBaseFileSplit bs = unsafeCast(s); Review comment: I see now. Makes sense ## File path: hudi-common/src/main/scala/org/apache/hudi/HoodieTableFileIndexBase.scala ## @@ -87,6 +89,12 @@ abstract class HoodieTableFileIndexBase(engineContext: HoodieEngineContext, refresh0() + /** + * Returns latest completed instant as seen by this instance of the file-index + */ + def latestCompletedInstant(): Option[HoodieInstant] = Review comment: It's def closer to former ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java ## @@ -65,11 +65,71 @@ import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.TypeUtils.unsafeCast; +import static org.apache.hudi.common.util.ValidationUtils.checkState; + public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils { private static final Logger LOG = LogManager.getLogger(HoodieRealtimeInputFormatUtils.class); - public static InputSplit[] getRealtimeSplits(Configuration conf, Stream fileSplits) { + public static InputSplit[] getRealtimeSplits(Configuration conf, List fileSplits) throws IOException { +if (fileSplits.isEmpty()) { + return new InputSplit[0]; +} + +FileSplit fileSplit = fileSplits.get(0); + +// Pre-process table-config to fetch virtual key info +Path partitionPath = fileSplit.getPath().getParent(); +HoodieTableMetaClient metaClient = getTableMetaClientForBasePathUnchecked(conf, partitionPath); + +Option hoodieVirtualKeyInfoOpt = getHoodieVirtualKeyInfo(metaClient); + +// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase} +HoodieInstant latestCommitInstant = + metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get(); + +InputSplit[] finalSplits = fileSplits.stream() + .map(split -> { +// There are 4 types of splits could we have to handle here +//- {@code BootstrapBaseFileSplit}: in case base file does have associated bootstrap file, +// but does NOT have any log files appended (convert it
[GitHub] [hudi] yihua commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s
yihua commented on pull request #4556: URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028442601 @alexeykudinkin is this PR ready for another look or you're still addressing comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4667: [HUDI-3276][Stacked on 4559] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat`
hudi-bot commented on pull request #4667: URL: https://github.com/apache/hudi/pull/4667#issuecomment-1028440018 ## CI report: * ed1df9c2c6a5c79a5b450cf37e783fddfe861d35 UNKNOWN * 29733a0d437997485b21327a6c256233e35c4d3b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5686) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4667: [HUDI-3276][Stacked on 4559] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat`
hudi-bot removed a comment on pull request #4667: URL: https://github.com/apache/hudi/pull/4667#issuecomment-1028372936 ## CI report: * ed1df9c2c6a5c79a5b450cf37e783fddfe861d35 UNKNOWN * b504aa798fa399e7b162203627490f9090656a32 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5495) * 29733a0d437997485b21327a6c256233e35c4d3b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5686) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #4078: [HUDI-2833][Design] Merge small archive files instead of expanding indefinitely.
yihua commented on pull request #4078: URL: https://github.com/apache/hudi/pull/4078#issuecomment-1028439247 cc @vinothchandar this PR adds new functionality in archived timeline with a feature flag and a piece of error handling logic which cannot be feature flagged. You may want to take another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4669: [HUDI-3239][Stacked on 4667] Convert `BaseHoodieTableFileIndex` to Java
hudi-bot commented on pull request #4669: URL: https://github.com/apache/hudi/pull/4669#issuecomment-1028436169 ## CI report: * 54e68ec73aa7292556d5f00c90859a776acc109f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5687) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4669: [HUDI-3239][Stacked on 4667] Convert `BaseHoodieTableFileIndex` to Java
hudi-bot removed a comment on pull request #4669: URL: https://github.com/apache/hudi/pull/4669#issuecomment-1028372962 ## CI report: * 851e61cdb31748501d28d638bc192ccc4955d665 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5494) * 54e68ec73aa7292556d5f00c90859a776acc109f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5687) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #4346: [HUDI-3045] New clustering regex match config to choose partitions when building clustering plan
yihua commented on pull request #4346: URL: https://github.com/apache/hudi/pull/4346#issuecomment-1028433650 > @yihua I feel it would be better to add a new option in `ClusteringPlanPartitionFilterMode` rather than doing regex in place. Yes, that could allow more flexible filtering. @zhangyue19921010 @YuweiXiao do either of you want to take a stab at this before 0.11.0 release? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028430746 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028432638 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028430746 ## CI report: * 5967651c87dc1a020e82a9a92de0f20ebeefb785 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot removed a comment on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028399211 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
hudi-bot commented on pull request #4705: URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028430624 ## CI report: * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org