[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028678659


   
   ## CI report:
   
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   * 540459a7e5a71f01b8e052424ccabad5a25b840e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028676981


   
   ## CI report:
   
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   * 540459a7e5a71f01b8e052424ccabad5a25b840e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028676981


   
   ## CI report:
   
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   * 540459a7e5a71f01b8e052424ccabad5a25b840e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028584032


   
   ## CI report:
   
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1028335503


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * d78e61c34acac7e23477f196388076cdd822dd69 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5679)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5681)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1028676669


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * d78e61c34acac7e23477f196388076cdd822dd69 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5679)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5681)
 
   * e48904554517673727ee5a0cb4055579f39e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028601646


   
   ## CI report:
   
   * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028572870


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


alexeykudinkin commented on a change in pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#discussion_r798204981



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -65,11 +65,70 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.TypeUtils.unsafeCast;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);
 
-  public static InputSplit[] getRealtimeSplits(Configuration conf, 
Stream fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {
+if (fileSplits.isEmpty()) {
+  return new InputSplit[0];
+}
+
+FileSplit fileSplit = fileSplits.get(0);
+
+// Pre-process table-config to fetch virtual key info
+Path partitionPath = fileSplit.getPath().getParent();
+HoodieTableMetaClient metaClient = 
getTableMetaClientForBasePathUnchecked(conf, partitionPath);
+
+Option hoodieVirtualKeyInfoOpt = 
getHoodieVirtualKeyInfo(metaClient);
+
+// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase}
+HoodieInstant latestCommitInstant =
+
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get();
+
+InputSplit[] finalSplits = fileSplits.stream()
+  .map(split -> {
+// There are 4 types of splits could we have to handle here
+//- {@code BootstrapBaseFileSplit}: in case base file does have 
associated bootstrap file,
+//  but does NOT have any log files appended (convert it to {@code 
RealtimeBootstrapBaseFileSplit})
+//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does 
have associated bootstrap file
+//  and does have log files appended
+//- {@code BaseFileWithLogsSplit}: in case base file does NOT have 
associated bootstrap file
+//   and does have log files appended;
+//- {@code FileSplit}: in case Hive passed down non-Hudi path
+if (split instanceof RealtimeBootstrapBaseFileSplit) {
+  return split;
+} else if (split instanceof BootstrapBaseFileSplit) {
+  BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split);
+  return createRealtimeBoostrapBaseFileSplit(
+  bootstrapBaseFileSplit,
+  metaClient.getBasePath(),
+  Collections.emptyList(),
+  latestCommitInstant.getTimestamp(),
+  false);
+} else if (split instanceof BaseFileWithLogsSplit) {
+  BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split);

Review comment:
   Yes, it's in sync. However, you brought up a very good point that the 
instant shouldn't actually be set here. This will be cleaned up in subsequent 
PRs where `HoodieRealtimeFileSplit` will be merged with `BaseWithLogFilesSplit`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


alexeykudinkin commented on a change in pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#discussion_r798204981



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -65,11 +65,70 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.TypeUtils.unsafeCast;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);
 
-  public static InputSplit[] getRealtimeSplits(Configuration conf, 
Stream fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {
+if (fileSplits.isEmpty()) {
+  return new InputSplit[0];
+}
+
+FileSplit fileSplit = fileSplits.get(0);
+
+// Pre-process table-config to fetch virtual key info
+Path partitionPath = fileSplit.getPath().getParent();
+HoodieTableMetaClient metaClient = 
getTableMetaClientForBasePathUnchecked(conf, partitionPath);
+
+Option hoodieVirtualKeyInfoOpt = 
getHoodieVirtualKeyInfo(metaClient);
+
+// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase}
+HoodieInstant latestCommitInstant =
+
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get();
+
+InputSplit[] finalSplits = fileSplits.stream()
+  .map(split -> {
+// There are 4 types of splits could we have to handle here
+//- {@code BootstrapBaseFileSplit}: in case base file does have 
associated bootstrap file,
+//  but does NOT have any log files appended (convert it to {@code 
RealtimeBootstrapBaseFileSplit})
+//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does 
have associated bootstrap file
+//  and does have log files appended
+//- {@code BaseFileWithLogsSplit}: in case base file does NOT have 
associated bootstrap file
+//   and does have log files appended;
+//- {@code FileSplit}: in case Hive passed down non-Hudi path
+if (split instanceof RealtimeBootstrapBaseFileSplit) {
+  return split;
+} else if (split instanceof BootstrapBaseFileSplit) {
+  BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split);
+  return createRealtimeBoostrapBaseFileSplit(
+  bootstrapBaseFileSplit,
+  metaClient.getBasePath(),
+  Collections.emptyList(),
+  latestCommitInstant.getTimestamp(),
+  false);
+} else if (split instanceof BaseFileWithLogsSplit) {
+  BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split);

Review comment:
   It does




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-3337) ParquetUtils fails extracting Parquet Column Range Metadata

2022-02-02 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-3337.
-
Resolution: Fixed

> ParquetUtils fails extracting Parquet Column Range Metadata
> ---
>
> Key: HUDI-3337
> URL: https://issues.apache.org/jira/browse/HUDI-3337
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> [~manojpec] discovered following issue while testing MT flows, with 
> {{TestHoodieBackedMetadata#testTableOperationsWithMetadataIndex}} failing 
> with:
>  
> {code:java}
> 17400 [Executor task launch worker for task 240] ERROR 
> org.apache.hudi.metadata.HoodieTableMetadataUtil  - Failed to read column 
> stats for 
> /var/folders/t7/kr69rlvx5rdd824m61zjqkjrgn/T/junit2402861080324269156/dataset/2016/03/15/44396fda-48db-4d10-9f47-275c39317115-0_0-101-234_003.parquet
> java.lang.ClassCastException: 
> org.apache.parquet.io.api.Binary$ByteArrayBackedBinary cannot be cast to 
> java.lang.Integer
>   at 
> org.apache.hudi.common.util.ParquetUtils.convertToNativeJavaType(ParquetUtils.java:369)
>   at 
> org.apache.hudi.common.util.ParquetUtils.lambda$null$2(ParquetUtils.java:305)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
>   at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readRangeFromParquetMetadata(ParquetUtils.java:313)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getColumnStats(HoodieTableMetadataUtil.java:878)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.translateWriteStatToColumnStats(HoodieTableMetadataUtil.java:858)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$createColumnStatsFromWriteStats$7e2376a$1(HoodieTableMetadataUtil.java:819)
>   at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:134)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> 

[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028584032


   
   ## CI report:
   
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028556982


   
   ## CI report:
   
   * 891d9658daa099eb50560741086aac23924e3600 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669)
 
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


yihua commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798194069



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - ONCE : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem only once.
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode ONCE
+ * ```
+ *
+ */
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+   

[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028572870


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   * 071c6180b5023f782da229552a6d3f63d1e4a67b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5697)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028571985


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   * 071c6180b5023f782da229552a6d3f63d1e4a67b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028486554


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028571985


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   * 071c6180b5023f782da229552a6d3f63d1e4a67b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3362) Hudi 0.8.0 cannot roleback MoR table

2022-02-02 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486197#comment-17486197
 ] 

sivabalan narayanan commented on HUDI-3362:
---

[~vinish_jail97] : we might need to test the savepoint restore w/ clustering 
sooner. may be there are some gaps. 

 

> Hudi 0.8.0 cannot roleback MoR table
> 
>
> Key: HUDI-3362
> URL: https://issues.apache.org/jira/browse/HUDI-3362
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Istvan Darvas
>Priority: Blocker
> Attachments: hoodie.zip, rollback_20220131215514.txt, 
> rollback_log.txt, rollback_log_v2.txt
>
>
> Hi Guys,
>  
> Environment: AWS EMR 6.4 / Hudi v0.8.0
> Problem: I have a MoR table wich is ingested by DeltaStremer (batch style: 
> every 5 minutes from Kafka), and after a certain time, DeltaStremer stops 
> working with a message like this:
>  
> {{diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieRollbackException: Found commits after time 
> :20220131215051, please rollback greater commits first}}
>  
> It is usually a replace commit, I would say I am pretty sure in this.
> I have commits in the timeline:
>  
> 20220131214354<-before
> 20220131215051<-error message
> 20220131215514<-after
>  
> So as it was told to me, I try to rollback with the following steps in 
> hudi-cli:
> 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / 
> SUCCESS
> 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS
> 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / 
> FAILED
> 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS
> 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / 
> FAILED
>  
> Long story short, if I run a situation like this I am not able to solve it 
> with the known methods ;) - My use-case is working progress, but I cannot go 
> prod with an issue like this.
>  
> My question, what would be the right steps / commands to solve an issue like 
> this, and be able to restart deltastremer again.
>  
> This table, does not have dimension data, so I am happy to share the whole 
> table if someone curiuous (if that is needed or would be helpful, lets talk 
> in a private mail / slack about the sharing). ~15GB  ;) it was stoped after a 
> few run, actually after the 1st clustering.
>  
> I use this clustering config in the DeltaStremer:
> hoodie.clustering.inline=true
> hoodie.clustering.inline.enabled=true
> hoodie.clustering.inline.max.commits=36
> hoodie.clustering.plan.strategy.sort.columns=correlation_id
> hoodie.clustering.plan.strategy.daybased.lookback.partitions=7
> hoodie.clustering.plan.strategy.target.file.max.bytes=268435456
> hoodie.clustering.plan.strategy.small.file.limit=134217728
> hoodie.clustering.plan.strategy.max.bytes.per.group=671088640
>  
> I hope there is someone who can help me to tackle with this, becase if I able 
> to solve this manually, I would be confident to go prod.
> So thanks in advance,
> Darvi
> Slack Hudi: istvan darvas / U02NTACPHPU



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3362) Hudi 0.8.0 cannot roleback MoR table

2022-02-02 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-3362:
-

Assignee: sivabalan narayanan

> Hudi 0.8.0 cannot roleback MoR table
> 
>
> Key: HUDI-3362
> URL: https://issues.apache.org/jira/browse/HUDI-3362
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Istvan Darvas
>Assignee: sivabalan narayanan
>Priority: Blocker
> Attachments: hoodie.zip, rollback_20220131215514.txt, 
> rollback_log.txt, rollback_log_v2.txt
>
>
> Hi Guys,
>  
> Environment: AWS EMR 6.4 / Hudi v0.8.0
> Problem: I have a MoR table wich is ingested by DeltaStremer (batch style: 
> every 5 minutes from Kafka), and after a certain time, DeltaStremer stops 
> working with a message like this:
>  
> {{diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieRollbackException: Found commits after time 
> :20220131215051, please rollback greater commits first}}
>  
> It is usually a replace commit, I would say I am pretty sure in this.
> I have commits in the timeline:
>  
> 20220131214354<-before
> 20220131215051<-error message
> 20220131215514<-after
>  
> So as it was told to me, I try to rollback with the following steps in 
> hudi-cli:
> 1.) connect --path s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep / 
> SUCCESS
> 2.) savepoint create --commit 20220131214354 --sparkMaster local[2] / SUCCESS
> 3.) savepoint rollback --savepoint 20220131214354 --sparkMaster local[2] / 
> FAILED
> 4.) savepoint create --commit 20220131215514 --sparkMaster local[2] / SUCCESS
> 5.) savepoint rollback --savepoint 20220131215514 --sparkMaster local[2] / 
> FAILED
>  
> Long story short, if I run a situation like this I am not able to solve it 
> with the known methods ;) - My use-case is working progress, but I cannot go 
> prod with an issue like this.
>  
> My question, what would be the right steps / commands to solve an issue like 
> this, and be able to restart deltastremer again.
>  
> This table, does not have dimension data, so I am happy to share the whole 
> table if someone curiuous (if that is needed or would be helpful, lets talk 
> in a private mail / slack about the sharing). ~15GB  ;) it was stoped after a 
> few run, actually after the 1st clustering.
>  
> I use this clustering config in the DeltaStremer:
> hoodie.clustering.inline=true
> hoodie.clustering.inline.enabled=true
> hoodie.clustering.inline.max.commits=36
> hoodie.clustering.plan.strategy.sort.columns=correlation_id
> hoodie.clustering.plan.strategy.daybased.lookback.partitions=7
> hoodie.clustering.plan.strategy.target.file.max.bytes=268435456
> hoodie.clustering.plan.strategy.small.file.limit=134217728
> hoodie.clustering.plan.strategy.max.bytes.per.group=671088640
>  
> I hope there is someone who can help me to tackle with this, becase if I able 
> to solve this manually, I would be confident to go prod.
> So thanks in advance,
> Darvi
> Slack Hudi: istvan darvas / U02NTACPHPU



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] zhangyue19921010 commented on pull request #4346: [HUDI-3045] New clustering regex match config to choose partitions when building clustering plan

2022-02-02 Thread GitBox


zhangyue19921010 commented on pull request #4346:
URL: https://github.com/apache/hudi/pull/4346#issuecomment-1028557537


   Sure, pick it up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028556982


   
   ## CI report:
   
   * 891d9658daa099eb50560741086aac23924e3600 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669)
 
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5696)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028555830


   
   ## CI report:
   
   * 891d9658daa099eb50560741086aac23924e3600 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669)
 
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798177169



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - ONCE : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem only once.
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode ONCE
+ * ```
+ *
+ */
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? 

[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1028555830


   
   ## CI report:
   
   * 891d9658daa099eb50560741086aac23924e3600 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669)
 
   * 86b335a77fb92cf34cb5e653694be626f6c7eba4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1027541190


   
   ## CI report:
   
   * 891d9658daa099eb50560741086aac23924e3600 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5669)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798176659



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - ONCE : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem only once.
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode ONCE
+ * ```
+ *
+ */
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? 

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798176560



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS

Review comment:
   Sure, changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798176463



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:

Review comment:
   Changed.
   Just two kinds of example now:
   1. --continuous + --min-validate-interval-seconds.
   2. default which will validate once.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028547091


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028522181


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (819e801 -> d681824)

2022-02-02 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 819e801  [HUDI-3322][HUDI-3343] Fixing Metadata Table Records 
Duplication Issues (#4716)
 add d681824  [HUDI-3337] Fixing Parquet Column Range metadata extraction 
(#4705)

No new revisions were added by this update.

Summary of changes:
 .../index/columnstats/ColumnStatsIndexHelper.java  |  11 +-
 .../common/model/HoodieColumnRangeMetadata.java|  10 +-
 .../org/apache/hudi/common/util/ParquetUtils.java  |  45 +++-
 ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json |  10 +
 ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json |  10 +
 ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json |  10 +
 ...-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json |  10 +
 ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json |  10 +
 ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json |  10 +
 ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json |  10 +
 ...-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json |  10 +
 .../index/zorder/z-index-table-merged.json |  16 +-
 .../test/resources/index/zorder/z-index-table.json |   8 +-
 .../hudi/functional/TestColumnStatsIndex.scala | 246 -
 .../hudi/functional/TestLayoutOptimization.scala   |  18 --
 15 files changed, 323 insertions(+), 111 deletions(-)
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-0-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-1-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-2-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/another-input-table-json/part-3-7e680484-e7e1-48b6-8289-1a7c483b530b-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-0-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-1-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-2-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json
 create mode 100644 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/input-table-json/part-3-4468afca-8a37-4ae8-a150-0c2fd3361080-c000.json


[GitHub] [hudi] nsivabalan merged pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


nsivabalan merged pull request #4705:
URL: https://github.com/apache/hudi/pull/4705


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


nsivabalan commented on a change in pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#discussion_r798155468



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -65,11 +65,70 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.TypeUtils.unsafeCast;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);
 
-  public static InputSplit[] getRealtimeSplits(Configuration conf, 
Stream fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {
+if (fileSplits.isEmpty()) {
+  return new InputSplit[0];
+}
+
+FileSplit fileSplit = fileSplits.get(0);
+
+// Pre-process table-config to fetch virtual key info
+Path partitionPath = fileSplit.getPath().getParent();
+HoodieTableMetaClient metaClient = 
getTableMetaClientForBasePathUnchecked(conf, partitionPath);
+
+Option hoodieVirtualKeyInfoOpt = 
getHoodieVirtualKeyInfo(metaClient);
+
+// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase}
+HoodieInstant latestCommitInstant =
+
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get();
+
+InputSplit[] finalSplits = fileSplits.stream()
+  .map(split -> {
+// There are 4 types of splits could we have to handle here
+//- {@code BootstrapBaseFileSplit}: in case base file does have 
associated bootstrap file,
+//  but does NOT have any log files appended (convert it to {@code 
RealtimeBootstrapBaseFileSplit})
+//- {@code RealtimeBootstrapBaseFileSplit}: in case base file does 
have associated bootstrap file
+//  and does have log files appended
+//- {@code BaseFileWithLogsSplit}: in case base file does NOT have 
associated bootstrap file
+//   and does have log files appended;
+//- {@code FileSplit}: in case Hive passed down non-Hudi path
+if (split instanceof RealtimeBootstrapBaseFileSplit) {
+  return split;
+} else if (split instanceof BootstrapBaseFileSplit) {
+  BootstrapBaseFileSplit bootstrapBaseFileSplit = unsafeCast(split);
+  return createRealtimeBoostrapBaseFileSplit(
+  bootstrapBaseFileSplit,
+  metaClient.getBasePath(),
+  Collections.emptyList(),
+  latestCommitInstant.getTimestamp(),
+  false);
+} else if (split instanceof BaseFileWithLogsSplit) {
+  BaseFileWithLogsSplit baseFileWithLogsSplit = unsafeCast(split);

Review comment:
   does the maxCommitTime in baseFileSplit will be in sync with 
latestCommitInstant computed at L89. Prior to this patch, use the 
latestCommitInstant computed here, where as now, we just reuse the same thats 
comes from BaseFileWithLogsSplit. 
   Just wanted to confirm as these are new code to me. 
   

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -65,11 +65,70 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.TypeUtils.unsafeCast;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);
 
-  public static InputSplit[] getRealtimeSplits(Configuration conf, 
Stream fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {

Review comment:
   this refactoring makes total sense assuming each FileSplit will 
correspond to one FileSlice. and there won't be a case where multiple 
FileSplits can store info about a single FileSlice. 
   thanks for doing this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028522181


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5694)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028513791


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


alexeykudinkin commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028521365


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028521054


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028469893


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028462344


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028513791


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests

2022-02-02 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-3366:
---

 Summary: Remove unnecessary hardcoded logic of disabling metadata 
table in tests
 Key: HUDI-3366
 URL: https://issues.apache.org/jira/browse/HUDI-3366
 Project: Apache Hudi
  Issue Type: Task
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests

2022-02-02 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3366:

Fix Version/s: 0.11.0

> Remove unnecessary hardcoded logic of disabling metadata table in tests
> ---
>
> Key: HUDI-3366
> URL: https://issues.apache.org/jira/browse/HUDI-3366
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests

2022-02-02 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3366:

Priority: Blocker  (was: Major)

> Remove unnecessary hardcoded logic of disabling metadata table in tests
> ---
>
> Key: HUDI-3366
> URL: https://issues.apache.org/jira/browse/HUDI-3366
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests

2022-02-02 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-3366:
---

Assignee: Ethan Guo

> Remove unnecessary hardcoded logic of disabling metadata table in tests
> ---
>
> Key: HUDI-3366
> URL: https://issues.apache.org/jira/browse/HUDI-3366
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028504428


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028459357


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


yihua commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798125748



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - ONCE : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem only once.
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode ONCE
+ * ```
+ *
+ */
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+   

[GitHub] [hudi] yihua commented on issue #4666: [SUPPORT] Table downgrade fails to delete non-existing file

2022-02-02 Thread GitBox


yihua commented on issue #4666:
URL: https://github.com/apache/hudi/issues/4666#issuecomment-1028487024


   > hey @ganczarek : looks like there is a bug 
https://issues.apache.org/jira/browse/HUDI-3346. we will work on the fix. 
should be straight forward. Atleast in this case, manually deleting the commit 
meta file just for this instant should be fine. Here is what could have 
resulted in this.
   > 
   > Just before downgrade, a commit was started, but before going into 
inflight or before a single marker file could be created, the process crashed. 
And so there is no marker dir only created for this commit. Downgrade code 
missed to check for existence in one place (but there are other places where 
this check is made) and so it failed.
   > 
   > I have created a tracking 
[ticket](https://issues.apache.org/jira/browse/HUDI-3346) here. We are good to 
close this.
   
   Sorry, I missed this.  There should be a check for marker directory before 
trying to delete it, which is missing before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028432638


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028486554


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


alexeykudinkin commented on a change in pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r798117376



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -87,40 +89,58 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
* @return a list of metadata table records
*/
   public static List 
convertMetadataToRecords(HoodieCommitMetadata commitMetadata, String 
instantTime) {
-List records = new LinkedList<>();
-List allPartitions = new LinkedList<>();
-commitMetadata.getPartitionToWriteStats().forEach((partitionStatName, 
writeStats) -> {
-  final String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) 
? NON_PARTITIONED_NAME : partitionStatName;
-  allPartitions.add(partition);
-
-  Map newFiles = new HashMap<>(writeStats.size());
-  writeStats.forEach(hoodieWriteStat -> {
-String pathWithPartition = hoodieWriteStat.getPath();
-if (pathWithPartition == null) {
-  // Empty partition
-  LOG.warn("Unable to find path in write stat to update metadata table 
" + hoodieWriteStat);
-  return;
-}
-
-int offset = partition.equals(NON_PARTITIONED_NAME) ? 
(pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
-long totalWriteBytes = newFiles.containsKey(filename)
-? newFiles.get(filename) + hoodieWriteStat.getTotalWriteBytes()
-: hoodieWriteStat.getTotalWriteBytes();
-newFiles.put(filename, totalWriteBytes);
-  });
-  // New files added to a partition
-  HoodieRecord record = HoodieMetadataPayload.createPartitionFilesRecord(
-  partition, Option.of(newFiles), Option.empty());
-  records.add(record);
-});
+List records = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().size());
+
+// Add record bearing partitions list
+ArrayList partitionsList = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet());
+
+
records.add(HoodieMetadataPayload.createPartitionListRecord(partitionsList));
+
+// New files added to a partition
+List> updatedFilesRecords =
+commitMetadata.getPartitionToWriteStats().entrySet()
+.stream()
+.map(entry -> {
+  String partitionStatName = entry.getKey();
+  List writeStats = entry.getValue();
+
+  String partition = 
partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : 
partitionStatName;
+
+  HashMap updatedFilesToSizesMapping =
+  writeStats.stream().reduce(new HashMap<>(writeStats.size()),
+  (map, stat) -> {
+String pathWithPartition = stat.getPath();
+if (pathWithPartition == null) {
+  // Empty partition
+  LOG.warn("Unable to find path in write stat to 
update metadata table " + stat);
+  return map;
+}
+
+int offset = partition.equals(NON_PARTITIONED_NAME)
+? (pathWithPartition.startsWith("/") ? 1 : 0)
+: partition.length() + 1;
+String filename = pathWithPartition.substring(offset);
+
+// Since write-stats are coming in no particular 
order, if the same
+// file have previously been appended to w/in the txn, 
we simply pick max
+// of the sizes as reported after every write, since 
file-sizes are
+// monotonically increasing (ie file-size never goes 
down, unless deleted)
+map.merge(filename, stat.getFileSizeInBytes(), 
Math::max);

Review comment:
   It does -- only case where we might provide something other than the 
file-size is `AppendHandle`, and it does set this to the full file size (it's a 
contract of this API)
   
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java#L417




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


alexeykudinkin commented on a change in pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r798116440



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
##
@@ -215,19 +215,16 @@ public HoodieMetadataPayload 
preCombine(HoodieMetadataPayload previousRecord) {
 
 if (filesystemMetadata != null) {
   filesystemMetadata.forEach((filename, fileInfo) -> {
-// If the filename wasnt present then we carry it forward
-if (!combinedFileInfo.containsKey(filename)) {
-  combinedFileInfo.put(filename, fileInfo);
+if (fileInfo.getIsDeleted()) {
+  combinedFileInfo.remove(filename);
 } else {
-  if (fileInfo.getIsDeleted()) {
-// file deletion
-combinedFileInfo.remove(filename);
-  } else {
-// file appends.
-combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, 
newFileInfo) -> {
-  return new HoodieMetadataFileInfo(oldFileInfo.getSize() + 
newFileInfo.getSize(), false);
-});
-  }
+  // NOTE: There are 2 possible cases here:
+  //- New file is created: in that case we're simply adding its 
info
+  //- File is appended to (only log-files of MOR tables on 
supported FS): in that case
+  //  we simply pick the info w/ largest file-size as the most 
recent one, since file's
+  //  sizes are increasing monotonically (meaning that the larger 
file-size is more recent one)
+  combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, 
newFileInfo) ->

Review comment:
   Correct




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on pull request #4679: [HUDI-3315] RFC-35 Make Flink writer stream friendly

2022-02-02 Thread GitBox


yihua commented on pull request #4679:
URL: https://github.com/apache/hudi/pull/4679#issuecomment-1028475241


   > cc @yihua could you also please review this from the angle of making the 
write client abstractions more friendly
   
   I'll review this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin closed pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`

2022-02-02 Thread GitBox


alexeykudinkin closed pull request #4560:
URL: https://github.com/apache/hudi/pull/4560


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`

2022-02-02 Thread GitBox


alexeykudinkin commented on pull request #4560:
URL: https://github.com/apache/hudi/pull/4560#issuecomment-1028472992


   Given that we're on a path to eventually deprecate `HoodieRecordPayload` 
this is unnecessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028457685


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028469893


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5693)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-02-02 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486166#comment-17486166
 ] 

Alexey Kudinkin commented on HUDI-1296:
---

I've done high-level scoping of this effort, on a high-level we'd need to:
 # Implement base HFile (Spark-compatible) reader
 ## Similar to ParquetFileFormat.buildReaderWithPartitionValues
 ## Used in MergeOnRead\{Snapshot|Incremental}Relation, passed to 
HoodieMergeOnReadRDD
 # Modify MergeOnReadSnapshotRelation to not assume the base file format and 
instead deduce it based on extension

 

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-02-02 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-1296:
--
Story Points: 6  (was: 4)

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] yihua commented on issue #4600: [SUPPORT]When hive queries Hudi data, the query path is wrong

2022-02-02 Thread GitBox


yihua commented on issue #4600:
URL: https://github.com/apache/hudi/issues/4600#issuecomment-1028469418


   As @xiarixiaoyao mentions, compaction should compact log files into base 
file formats like parquet, which can be then read by Hive.  There are different 
ways to trigger compaction, e.g., through inline/sync compaction, standalone 
HoodieCompactor, hudi-cli commands.  @danny0405 does Flink SQL and writer 
support compaction?
   
   @gubinjie When you refer to Kafka connector, do you mean Flink Kafka 
connector?  In Hudi, we also provide Kafka Connect Sink for Hudi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1028468345


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * 6d402f17d668a5a2d8bec5b8094fad6e997407b8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5688)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1028380183


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * c18bd337517c5842f2db1ee0075df19c05fafe91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5487)
 
   * 6d402f17d668a5a2d8bec5b8094fad6e997407b8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5688)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-02-02 Thread GitBox


nsivabalan commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r798108001



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,458 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieValidationException;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+/**
+ * A validator with spark-submit to compare list partitions and list files 
between metadata table and filesystem
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - CONTINUOUS : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem every 10 
minutes(default).
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode CONTINUOUS
+ * ```
+ *
+ * 
+ * You can specify the running mode of the validator through `--mode`.
+ * There are 2 modes of the {@link HoodieMetadataTableValidator}:
+ * - ONCE : This validator will compare the result of listing 
partitions/listing files between metadata table and filesystem only once.
+ * 
+ * Example command:
+ * ```
+ * spark-submit \
+ *  --class org.apache.hudi.utilities.HoodieMetadataTableValidator \
+ *  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
+ *  --master spark://:7077 \
+ *  --driver-memory 1g \
+ *  --executor-memory 1g \
+ *  
$HUDI_DIR/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.11.0-SNAPSHOT.jar
 \
+ *  --base-path basePath \
+ *  --min-validate-interval-seconds 60 \
+ *  --mode ONCE
+ * ```
+ *
+ */
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+  

[GitHub] [hudi] nsivabalan closed pull request #4734: [DO NOT MERGE] Testing CI 1

2022-02-02 Thread GitBox


nsivabalan closed pull request #4734:
URL: https://github.com/apache/hudi/pull/4734


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed pull request #4732: [DO_NOT_MERGE][WIP] debugging some tests for metadata restore

2022-02-02 Thread GitBox


nsivabalan closed pull request #4732:
URL: https://github.com/apache/hudi/pull/4732


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed pull request #4736: [DO NOT MERGE] Testing CI 3

2022-02-02 Thread GitBox


nsivabalan closed pull request #4736:
URL: https://github.com/apache/hudi/pull/4736


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed pull request #4735: [DO NOT MERGE] Testing CI 2

2022-02-02 Thread GitBox


nsivabalan closed pull request #4735:
URL: https://github.com/apache/hudi/pull/4735


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed pull request #4737: [DO NOT MERGE] Testing CI 4

2022-02-02 Thread GitBox


nsivabalan closed pull request #4737:
URL: https://github.com/apache/hudi/pull/4737


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed pull request #4738: [DO NOT MERGE] Testing CI 5

2022-02-02 Thread GitBox


nsivabalan closed pull request #4738:
URL: https://github.com/apache/hudi/pull/4738


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028457463


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028462344


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5692)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


nsivabalan commented on a change in pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r798098988



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
##
@@ -215,19 +215,16 @@ public HoodieMetadataPayload 
preCombine(HoodieMetadataPayload previousRecord) {
 
 if (filesystemMetadata != null) {
   filesystemMetadata.forEach((filename, fileInfo) -> {
-// If the filename wasnt present then we carry it forward
-if (!combinedFileInfo.containsKey(filename)) {
-  combinedFileInfo.put(filename, fileInfo);
+if (fileInfo.getIsDeleted()) {
+  combinedFileInfo.remove(filename);
 } else {
-  if (fileInfo.getIsDeleted()) {
-// file deletion
-combinedFileInfo.remove(filename);
-  } else {
-// file appends.
-combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, 
newFileInfo) -> {
-  return new HoodieMetadataFileInfo(oldFileInfo.getSize() + 
newFileInfo.getSize(), false);
-});
-  }
+  // NOTE: There are 2 possible cases here:
+  //- New file is created: in that case we're simply adding its 
info
+  //- File is appended to (only log-files of MOR tables on 
supported FS): in that case
+  //  we simply pick the info w/ largest file-size as the most 
recent one, since file's
+  //  sizes are increasing monotonically (meaning that the larger 
file-size is more recent one)
+  combinedFileInfo.merge(filename, fileInfo, (oldFileInfo, 
newFileInfo) ->

Review comment:
   merge func takes care of adding an entry for the first time and hence 
remove L219 and 220 ? 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -87,40 +89,58 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
* @return a list of metadata table records
*/
   public static List 
convertMetadataToRecords(HoodieCommitMetadata commitMetadata, String 
instantTime) {
-List records = new LinkedList<>();
-List allPartitions = new LinkedList<>();
-commitMetadata.getPartitionToWriteStats().forEach((partitionStatName, 
writeStats) -> {
-  final String partition = partitionStatName.equals(EMPTY_PARTITION_NAME) 
? NON_PARTITIONED_NAME : partitionStatName;
-  allPartitions.add(partition);
-
-  Map newFiles = new HashMap<>(writeStats.size());
-  writeStats.forEach(hoodieWriteStat -> {
-String pathWithPartition = hoodieWriteStat.getPath();
-if (pathWithPartition == null) {
-  // Empty partition
-  LOG.warn("Unable to find path in write stat to update metadata table 
" + hoodieWriteStat);
-  return;
-}
-
-int offset = partition.equals(NON_PARTITIONED_NAME) ? 
(pathWithPartition.startsWith("/") ? 1 : 0) : partition.length() + 1;
-String filename = pathWithPartition.substring(offset);
-long totalWriteBytes = newFiles.containsKey(filename)
-? newFiles.get(filename) + hoodieWriteStat.getTotalWriteBytes()
-: hoodieWriteStat.getTotalWriteBytes();
-newFiles.put(filename, totalWriteBytes);
-  });
-  // New files added to a partition
-  HoodieRecord record = HoodieMetadataPayload.createPartitionFilesRecord(
-  partition, Option.of(newFiles), Option.empty());
-  records.add(record);
-});
+List records = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().size());
+
+// Add record bearing partitions list
+ArrayList partitionsList = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet());
+
+
records.add(HoodieMetadataPayload.createPartitionListRecord(partitionsList));
+
+// New files added to a partition
+List> updatedFilesRecords =
+commitMetadata.getPartitionToWriteStats().entrySet()
+.stream()
+.map(entry -> {
+  String partitionStatName = entry.getKey();
+  List writeStats = entry.getValue();
+
+  String partition = 
partitionStatName.equals(EMPTY_PARTITION_NAME) ? NON_PARTITIONED_NAME : 
partitionStatName;
+
+  HashMap updatedFilesToSizesMapping =
+  writeStats.stream().reduce(new HashMap<>(writeStats.size()),
+  (map, stat) -> {
+String pathWithPartition = stat.getPath();
+if (pathWithPartition == null) {
+  // Empty partition
+  LOG.warn("Unable to find path in write stat to 
update metadata table " + stat);
+  return map;
+}
+
+int offset = partition.equals(NON_PARTITIONED_NAME)
+? (pathWithPartition.startsWith("/") ? 1 : 0)
+: 

[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028459357


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028457602


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028457685


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028455703


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028457463


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1028457602


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028455516


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665)
 
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4659: [HUDI-3091] Making SIMPLE index as the default index type

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4659:
URL: https://github.com/apache/hudi/pull/4659#issuecomment-1019743003


   
   ## CI report:
   
   * 647c6a4d9dad7e517c48d70857f9ebd2faf5c57c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on pull request #3866: [HUDI-1430] SparkDataFrameWriteClient

2022-02-02 Thread GitBox


yihua commented on pull request #3866:
URL: https://github.com/apache/hudi/pull/3866#issuecomment-1028456039


   @xushiyan when this is mostly in shape, we can go through the code again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028430624


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028455703


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   * e6c57d02768a5561537546c4380ed141a4a497e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028455516


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665)
 
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   * 7f793744421a5ee304d5dff89d23e1e925bfd1cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028382543


   
   ## CI report:
   
   * 77d11131baabd1c4e3cc2050337daca4df5f6427 UNKNOWN
   * 3d9c2ae28da858d1e8476052c99391015effb7db UNKNOWN
   * 31b0669d7b638bd65a17b22a2ceb772f2627512c UNKNOWN
   * 28a5a4f537544d35dfcd8700a7b97fb7216682ce UNKNOWN
   * c09e228f7cce78a7dbbc394e93b3cf8c6c3d4d5f UNKNOWN
   * 5b8f5819fff8fec34864eb409fd429b95be17b9b UNKNOWN
   * 5d37935bc8bb33260735d782ca560fd59e02f321 UNKNOWN
   * f911d869f50009e5cd9f3fd341c83c732d7531ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5634)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5657)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5659)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5665)
 
   * 34b94b96b03109555201092bfabce21793add437 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5689)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


alexeykudinkin commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028453307


   @yihua yeah, it's rebased on master now and ready for another pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


alexeykudinkin commented on a change in pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#discussion_r798092122



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimeFileSplit.java
##
@@ -44,9 +44,7 @@
 
   private Option hoodieVirtualKeyInfo = Option.empty();
 
-  public HoodieRealtimeFileSplit() {
-super();

Review comment:
   We don't need to remove it, but there's also no point in keeping it

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -144,28 +204,32 @@
 return rtSplits.toArray(new InputSplit[0]);
   }
 
+  /**
+   * @deprecated will be replaced w/ {@link #getRealtimeSplits(Configuration, 
List)}
+   */
   // get IncrementalRealtimeSplits
-  public static InputSplit[] getIncrementalRealtimeSplits(Configuration conf, 
Stream fileSplits) throws IOException {
+  public static InputSplit[] getIncrementalRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {
+
checkState(fileSplits.stream().allMatch(HoodieRealtimeInputFormatUtils::doesBelongToIncrementalQuery),
+"All splits have to belong to incremental query");
+
 List rtSplits = new ArrayList<>();
-List fileSplitList = fileSplits.collect(Collectors.toList());
-Set partitionSet = fileSplitList.stream().map(f -> 
f.getPath().getParent()).collect(Collectors.toSet());
+Set partitionSet = fileSplits.stream().map(f -> 
f.getPath().getParent()).collect(Collectors.toSet());
 Map partitionsToMetaClient = 
getTableMetaClientByPartitionPath(conf, partitionSet);
 // Pre process tableConfig from first partition to fetch virtual key info
 Option hoodieVirtualKeyInfo = Option.empty();
 if (partitionSet.size() > 0) {
   hoodieVirtualKeyInfo = 
getHoodieVirtualKeyInfo(partitionsToMetaClient.get(partitionSet.iterator().next()));
 }
 Option finalHoodieVirtualKeyInfo = 
hoodieVirtualKeyInfo;
-fileSplitList.stream().forEach(s -> {
+fileSplits.stream().forEach(s -> {
   // deal with incremental query.
   try {
 if (s instanceof BaseFileWithLogsSplit) {
-  BaseFileWithLogsSplit bs = (BaseFileWithLogsSplit)s;
-  if (bs.getBelongToIncrementalSplit()) {
-rtSplits.add(new HoodieRealtimeFileSplit(bs, bs.getBasePath(), 
bs.getDeltaLogFiles(), bs.getMaxCommitTime(), finalHoodieVirtualKeyInfo));
-  }
+  BaseFileWithLogsSplit bs = unsafeCast(s);
+  rtSplits.add(new HoodieRealtimeFileSplit(bs, bs.getBasePath(), 
bs.getDeltaLogFiles(), bs.getMaxCommitTime(), finalHoodieVirtualKeyInfo));
 } else if (s instanceof RealtimeBootstrapBaseFileSplit) {
-  rtSplits.add(s);
+  RealtimeBootstrapBaseFileSplit bs = unsafeCast(s);

Review comment:
   I see now. Makes sense

##
File path: 
hudi-common/src/main/scala/org/apache/hudi/HoodieTableFileIndexBase.scala
##
@@ -87,6 +89,12 @@ abstract class HoodieTableFileIndexBase(engineContext: 
HoodieEngineContext,
 
   refresh0()
 
+  /**
+   * Returns latest completed instant as seen by this instance of the 
file-index
+   */
+  def latestCompletedInstant(): Option[HoodieInstant] =

Review comment:
   It's def closer to former

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##
@@ -65,11 +65,71 @@
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.TypeUtils.unsafeCast;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
 public class HoodieRealtimeInputFormatUtils extends HoodieInputFormatUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieRealtimeInputFormatUtils.class);
 
-  public static InputSplit[] getRealtimeSplits(Configuration conf, 
Stream fileSplits) {
+  public static InputSplit[] getRealtimeSplits(Configuration conf, 
List fileSplits) throws IOException {
+if (fileSplits.isEmpty()) {
+  return new InputSplit[0];
+}
+
+FileSplit fileSplit = fileSplits.get(0);
+
+// Pre-process table-config to fetch virtual key info
+Path partitionPath = fileSplit.getPath().getParent();
+HoodieTableMetaClient metaClient = 
getTableMetaClientForBasePathUnchecked(conf, partitionPath);
+
+Option hoodieVirtualKeyInfoOpt = 
getHoodieVirtualKeyInfo(metaClient);
+
+// NOTE: This timeline is kept in sync w/ {@code HoodieTableFileIndexBase}
+HoodieInstant latestCommitInstant =
+
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().lastInstant().get();
+
+InputSplit[] finalSplits = fileSplits.stream()
+  .map(split -> {
+// There are 4 types of splits could we have to handle here
+//- {@code BootstrapBaseFileSplit}: in case base file does have 
associated bootstrap file,
+//  but does NOT have any log files appended (convert it 

[GitHub] [hudi] yihua commented on pull request #4556: [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s

2022-02-02 Thread GitBox


yihua commented on pull request #4556:
URL: https://github.com/apache/hudi/pull/4556#issuecomment-1028442601


   @alexeykudinkin is this PR ready for another look or you're still addressing 
comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4667: [HUDI-3276][Stacked on 4559] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat`

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4667:
URL: https://github.com/apache/hudi/pull/4667#issuecomment-1028440018


   
   ## CI report:
   
   * ed1df9c2c6a5c79a5b450cf37e783fddfe861d35 UNKNOWN
   * 29733a0d437997485b21327a6c256233e35c4d3b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4667: [HUDI-3276][Stacked on 4559] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat`

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4667:
URL: https://github.com/apache/hudi/pull/4667#issuecomment-1028372936


   
   ## CI report:
   
   * ed1df9c2c6a5c79a5b450cf37e783fddfe861d35 UNKNOWN
   * b504aa798fa399e7b162203627490f9090656a32 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5495)
 
   * 29733a0d437997485b21327a6c256233e35c4d3b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5686)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on pull request #4078: [HUDI-2833][Design] Merge small archive files instead of expanding indefinitely.

2022-02-02 Thread GitBox


yihua commented on pull request #4078:
URL: https://github.com/apache/hudi/pull/4078#issuecomment-1028439247


   cc @vinothchandar this PR adds new functionality in archived timeline with a 
feature flag and a piece of error handling logic which cannot be feature 
flagged.  You may want to take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4669: [HUDI-3239][Stacked on 4667] Convert `BaseHoodieTableFileIndex` to Java

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4669:
URL: https://github.com/apache/hudi/pull/4669#issuecomment-1028436169


   
   ## CI report:
   
   * 54e68ec73aa7292556d5f00c90859a776acc109f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4669: [HUDI-3239][Stacked on 4667] Convert `BaseHoodieTableFileIndex` to Java

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4669:
URL: https://github.com/apache/hudi/pull/4669#issuecomment-1028372962


   
   ## CI report:
   
   * 851e61cdb31748501d28d638bc192ccc4955d665 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5494)
 
   * 54e68ec73aa7292556d5f00c90859a776acc109f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5687)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on pull request #4346: [HUDI-3045] New clustering regex match config to choose partitions when building clustering plan

2022-02-02 Thread GitBox


yihua commented on pull request #4346:
URL: https://github.com/apache/hudi/pull/4346#issuecomment-1028433650


   > @yihua I feel it would be better to add a new option in 
`ClusteringPlanPartitionFilterMode` rather than doing regex in place.
   
   Yes, that could allow more flexible filtering.  @zhangyue19921010 @YuweiXiao 
do either of you want to take a stab at this before 0.11.0 release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028430746


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028432638


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5691)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1028430746


   
   ## CI report:
   
   * 5967651c87dc1a020e82a9a92de0f20ebeefb785 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot removed a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028399211


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-02-02 Thread GitBox


hudi-bot commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1028430624


   
   ## CI report:
   
   * fc6f1f4af2201fb5541aeae70c745e7b6dc3981e UNKNOWN
   * 2fa66c4290ee2555973e29934ca7ecb8e4a0e709 UNKNOWN
   * 2ad2dcddf911dea6b73d6fb7a394dbbb5297 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5684)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5690)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >