[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4897:
URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049580652


   
   ## CI report:
   
   * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4897:
URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049544024


   
   ## CI report:
   
   * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049556727


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049580454


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6267)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #4789: [HUDI-1296] Support Metadata Table in Spark Datasource

2022-02-23 Thread GitBox


vinothchandar commented on a change in pull request #4789:
URL: https://github.com/apache/hudi/pull/4789#discussion_r813615713



##
File path: 
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieAvroDeserializerTrait.scala
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.avro
+
+/**
+ * Deserializes Avro payload into Catalyst object
+ *
+ * NOTE: This is low-level component operating on Spark internal data-types 
(comprising [[InternalRow]]).
+ *   If you're looking to convert Avro into "deserialized" [[Row]] 
(comprised of Java native types),
+ *   please check [[AvroConversionUtils]]
+ */
+trait HoodieAvroDeserializerTrait {
+  final def deserialize(data: Any): Option[Any] =
+doDeserialize(data) match {
+  case opt: Option[_] => opt// As of Spark 3.1, this will return data 
wrapped with Option, so we fetch the data

Review comment:
   if there is code specific to spark versions, can you move them into the 
adapters? how does this work with spark 2.x? 

##
File path: 
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##
@@ -1,380 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi
-
-import java.nio.ByteBuffer
-import java.sql.{Date, Timestamp}
-import java.time.Instant
-
-import org.apache.avro.Conversions.DecimalConversion
-import org.apache.avro.LogicalTypes.{TimestampMicros, TimestampMillis}
-import org.apache.avro.Schema.Type._
-import org.apache.avro.generic.GenericData.{Fixed, Record}
-import org.apache.avro.generic.{GenericData, GenericFixed, GenericRecord}
-import org.apache.avro.{LogicalTypes, Schema}
-
-import org.apache.spark.sql.Row
-import org.apache.spark.sql.avro.SchemaConverters
-import org.apache.spark.sql.catalyst.expressions.GenericRow
-import org.apache.spark.sql.catalyst.util.DateTimeUtils
-import org.apache.spark.sql.types._
-
-import org.apache.hudi.AvroConversionUtils._
-import org.apache.hudi.exception.HoodieIncompatibleSchemaException
-
-import scala.collection.JavaConverters._
-
-object AvroConversionHelper {
-
-  private def createDecimal(decimal: java.math.BigDecimal, precision: Int, 
scale: Int): Decimal = {
-if (precision <= Decimal.MAX_LONG_DIGITS) {
-  // Constructs a `Decimal` with an unscaled `Long` value if possible.
-  Decimal(decimal.unscaledValue().longValue(), precision, scale)
-} else {
-  // Otherwise, resorts to an unscaled `BigInteger` instead.
-  Decimal(decimal, precision, scale)
-}
-  }
-
-  /**
-*
-* Returns a converter function to convert row in avro format to GenericRow 
of catalyst.
-*
-* @param sourceAvroSchema Source schema before conversion inferred from 
avro file by passed in
-* by user.
-* @param targetSqlTypeTarget catalyst sql type after the conversion.
-* @return returns a converter function to convert row in avro format to 
GenericRow of catalyst.
-*/
-  def createConverterToRow(sourceAvroSchema: Schema,

Review comment:
   what happens to all this code? consolidated? if so, are we sure there 
are no subtle differences from the different conversion implementations?

##
File path: 
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##
@@ -18,20 +18,105 @@
 
 package org.apache.hudi
 
-imp

[GitHub] [hudi] LinMingQiang commented on pull request #4724: [HUDI-2815] add partial overwrite payload to support partial overwrit…

2022-02-23 Thread GitBox


LinMingQiang commented on pull request #4724:
URL: https://github.com/apache/hudi/pull/4724#issuecomment-1049573640


   Do I need to modify the preCombine in the 
HoodieMergedLogRecordScanner.processNextRecord method? What I understand is 
that when we read a log file, we need to do deduplication and also call 
preCombine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Guanpx closed issue #4658: [SUPPORT] Data lose with Flink write COW insert table, Flink web UI show Records Received was different with HIVE count(1)

2022-02-23 Thread GitBox


Guanpx closed issue #4658:
URL: https://github.com/apache/hudi/issues/4658


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Guanpx commented on issue #4658: [SUPPORT] Data lose with Flink write COW insert table, Flink web UI show Records Received was different with HIVE count(1)

2022-02-23 Thread GitBox


Guanpx commented on issue #4658:
URL: https://github.com/apache/hudi/issues/4658#issuecomment-1049571213


   should use  'write.operation' = 'insert';
   but in my code was 'hoodie.datasource.write.operation' = 'insert', hudi will 
use default config:  'write.operation' = 'upsert', so close this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1869) Upgrading Spark3 To 3.1

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-1869.


> Upgrading Spark3 To 3.1
> ---
>
> Key: HUDI-1869
> URL: https://issues.apache.org/jira/browse/HUDI-1869
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Spark 3.1 has changed some behavior of the internal class and interface for 
> both spark-sql and spark-core module.
> Currently hudi can't compile success under the spark 3.1. We need support sql 
> support for spark 3.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-1832) Support Hoodie CLI Command In Spark SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-1832.

Resolution: Duplicate

These will be supported by call produce command.

> Support Hoodie CLI Command In Spark SQL
> ---
>
> Key: HUDI-1832
> URL: https://issues.apache.org/jira/browse/HUDI-1832
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: spark
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Major
>
> Move the Hoodie CLI command to spark sql. The syntax just like the follow:
> {code:java}
> CLI_COMMAND [ (param_key1 = value1, param_key2 = value2...) ]
> {code}
> e.g.
> {code:java}
> commits showcommit 
> showfiles (commit = ‘20210114221306’, limit = 10)show 
> rollbackssavepoint create (commit = ‘20210114221306’)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4895:
URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049507780


   
   ## CI report:
   
   * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4895:
URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049556954


   
   ## CI report:
   
   * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049465789


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049556727


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049525991


   
   ## CI report:
   
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049554997


   
   ## CI report:
   
   * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4898: rdd unpersist optimization

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4898:
URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049549978


   
   ## CI report:
   
   * d8b342602971c50db10d090945f8d2e757951852 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4898: rdd unpersist optimization

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4898:
URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049548486


   
   ## CI report:
   
   * d8b342602971c50db10d090945f8d2e757951852 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4898: rdd unpersist optimization

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4898:
URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049548486


   
   ## CI report:
   
   * d8b342602971c50db10d090945f8d2e757951852 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-2482) Support drop partitions SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2482.


> Support drop partitions SQL
> ---
>
> Key: HUDI-2482
> URL: https://issues.apache.org/jira/browse/HUDI-2482
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-2456) Support show partitions SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2456.


> Support show partitions SQL
> ---
>
> Key: HUDI-2456
> URL: https://issues.apache.org/jira/browse/HUDI-2456
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>
> Spark SQL support the following syntax to show hudi tabls's partitions.
> {code:java}
> SHOW PARTITIONS tableIdentifier partitionSpec?{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] scxwhite opened a new pull request #4898: rdd unpersist optimization

2022-02-23 Thread GitBox


scxwhite opened a new pull request #4898:
URL: https://github.com/apache/hudi/pull/4898


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   When we start multiple hudi jobs in a sparkSession, or cache some rdds 
before a hudi job starts, SparkRDDWriteClient#releaseResources will release all 
persistent rdds, causing spark to recalculate.
   So I think we should add the switch of whether to release all rdds.
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-2538) Persist configs to hoodie.properties on the first write

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2538.


> Persist configs to hoodie.properties on the first write
> ---
>
> Key: HUDI-2538
> URL: https://issues.apache.org/jira/browse/HUDI-2538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Some configs, like `keygenerator.class`, `hive_style_partitioning`, 
> `partitionpath.urlencode` should be persisted to hoodie.properties when write 
> data in the first time. Otherwise, some inconsistent behavior will happen. 
> And the other write operation do not need to provide these configs. If 
> configs provided don't match the existing configs, raise exceptions. 
> And, this is also useful to solve some of the keyGenerator discrepancy issues 
> between DataFrame writer and SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4878: [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator

2022-02-23 Thread GitBox


zhangyue19921010 commented on a change in pull request #4878:
URL: https://github.com/apache/hudi/pull/4878#discussion_r813588000



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -500,19 +578,179 @@ public int compare(FileSlice o1, FileSlice o2) {
 }
   }
 
-  public static class HoodieBaseFileCompactor implements 
Comparator, Serializable {
+  public static class HoodieBaseFileComparator implements 
Comparator, Serializable {
 
 @Override
 public int compare(HoodieBaseFile o1, HoodieBaseFile o2) {
   return o1.getPath().compareTo(o2.getPath());
 }
   }
 
-  public static class HoodieFileGroupCompactor implements 
Comparator, Serializable {
+  public static class HoodieFileGroupComparator implements 
Comparator, Serializable {
 
 @Override
 public int compare(HoodieFileGroup o1, HoodieFileGroup o2) {
   return o1.getFileGroupId().compareTo(o2.getFileGroupId());
 }
   }
-}
\ No newline at end of file
+
+  public static class HoodieColumnRangeMetadataComparator
+  implements Comparator>, Serializable {
+
+@Override
+public int compare(HoodieColumnRangeMetadata o1, 
HoodieColumnRangeMetadata o2) {
+  return o1.toString().compareTo(o2.toString());
+}
+  }
+
+  /**
+   * Class for storing relevant information for metadata table validation.
+   * 
+   * If metadata table is disabled, the APIs provide the information, e.g., 
file listing,
+   * index, from the file system and base files.  If metadata table is 
enabled, the APIs
+   * provide the information from the metadata table.  The same API is 
expected to return
+   * the same information regardless of whether metadata table is enabled, 
which is
+   * verified in the {@link HoodieMetadataTableValidator}.
+   */
+  private static class HoodieMetadataValidationContext implements Serializable 
{
+private HoodieTableMetaClient metaClient;
+private HoodieTableFileSystemView fileSystemView;
+private HoodieTableMetadata tableMetadata;
+private boolean enableMetadataTable;
+private List allColumnNameList;
+
+public HoodieMetadataValidationContext(
+HoodieEngineContext engineContext, Config cfg, HoodieTableMetaClient 
metaClient,
+boolean enableMetadataTable) {
+  this.metaClient = metaClient;
+  this.enableMetadataTable = enableMetadataTable;
+  HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder()
+  .enable(enableMetadataTable)
+  .withMetadataIndexBloomFilter(enableMetadataTable)
+  .withMetadataIndexColumnStats(enableMetadataTable)
+  .withMetadataIndexForAllColumns(enableMetadataTable)
+  .withAssumeDatePartitioning(cfg.assumeDatePartitioning)
+  .build();
+  this.fileSystemView = 
FileSystemViewManager.createInMemoryFileSystemView(engineContext,
+  metaClient, metadataConfig);
+  this.tableMetadata = HoodieTableMetadata.create(engineContext, 
metadataConfig, metaClient.getBasePath(),
+  FileSystemViewStorageConfig.SPILLABLE_DIR.defaultValue());
+  if 
(metaClient.getCommitsTimeline().filterCompletedInstants().countInstants() > 0) 
{
+this.allColumnNameList = getAllColumnNames();
+  }
+}
+
+public List getSortedLatestBaseFileList(String 
partitionPath) {
+  return fileSystemView.getLatestBaseFiles(partitionPath)
+  .sorted(new HoodieBaseFileComparator()).collect(Collectors.toList());
+}
+
+public List getSortedLatestFileSliceList(String partitionPath) {
+  return fileSystemView.getLatestFileSlices(partitionPath)
+  .sorted(new FileSliceComparator()).collect(Collectors.toList());
+}
+
+public List getSortedAllFileGroupList(String 
partitionPath) {
+  return fileSystemView.getAllFileGroups(partitionPath)
+  .sorted(new 
HoodieFileGroupComparator()).collect(Collectors.toList());
+}
+
+public List> getSortedColumnStatsList(
+String partitionPath, List baseFileNameList) {
+  LOG.info("All column names for getting column stats: " + 
allColumnNameList);
+  if (enableMetadataTable) {
+List> partitionFileNameList = 
baseFileNameList.stream()
+.map(filename -> Pair.of(partitionPath, 
filename)).collect(Collectors.toList());
+return allColumnNameList.stream()
+.flatMap(columnName ->
+tableMetadata.getColumnStats(partitionFileNameList, 
columnName).values().stream()
+.map(stats -> new HoodieColumnRangeMetadata<>(
+stats.getFileName(),
+columnName,
+stats.getMinValue(),
+stats.getMaxValue(),
+stats.getNullCount(),
+stats.getValueCount(),
+stats.getTotalSize(),
+stats.getTotalUncompressedSize()))
+.collect(Collectors.

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4878: [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator

2022-02-23 Thread GitBox


zhangyue19921010 commented on a change in pull request #4878:
URL: https://github.com/apache/hudi/pull/4878#discussion_r813589004



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -438,27 +481,62 @@ private void 
validateLatestBaseFiles(HoodieTableFileSystemView metaFsView, Hoodi
   /**
* Compare getLatestFileSlices between metadata table and fileSystem.
*/
-  private void validateLatestFileSlices(HoodieTableFileSystemView metaFsView, 
HoodieTableFileSystemView fsView, String partitionPath) {
+  private void validateLatestFileSlices(
+  HoodieMetadataValidationContext metadataTableBasedContext,
+  HoodieMetadataValidationContext fsBasedContext, String partitionPath) {
 
-List latestFileSlicesFromMetadataTable = 
metaFsView.getLatestFileSlices(partitionPath).sorted(new 
FileSliceCompactor()).collect(Collectors.toList());
-List latestFileSlicesFromFS = 
fsView.getLatestFileSlices(partitionPath).sorted(new 
FileSliceCompactor()).collect(Collectors.toList());
+List latestFileSlicesFromMetadataTable = 
metadataTableBasedContext.getSortedLatestFileSliceList(partitionPath);
+List latestFileSlicesFromFS = 
fsBasedContext.getSortedLatestFileSliceList(partitionPath);
 
-LOG.info("Latest file list from metadata: " + 
latestFileSlicesFromMetadataTable + ". For partition " + partitionPath);
-LOG.info("Latest file list from direct listing: " + latestFileSlicesFromFS 
+ ". For partition " + partitionPath);
+LOG.debug("Latest file list from metadata: " + 
latestFileSlicesFromMetadataTable + ". For partition " + partitionPath);
+LOG.debug("Latest file list from direct listing: " + 
latestFileSlicesFromFS + ". For partition " + partitionPath);
 
-validateFileSlice(latestFileSlicesFromMetadataTable, 
latestFileSlicesFromFS, partitionPath);
+validate(latestFileSlicesFromMetadataTable, latestFileSlicesFromFS, 
partitionPath, "file slices");
 LOG.info("Validation of getLatestFileSlices succeeded for partition " + 
partitionPath);
   }
 
-  private HoodieTableFileSystemView 
createHoodieTableFileSystemView(HoodieSparkEngineContext engineContext, boolean 
enableMetadataTable) {
+  private void validateAllColumnStats(
+  HoodieMetadataValidationContext metadataTableBasedContext,
+  HoodieMetadataValidationContext fsBasedContext, String partitionPath) {
+List latestBaseFilenameList = 
fsBasedContext.getSortedLatestBaseFileList(partitionPath)
+.stream().map(BaseFile::getFileName).collect(Collectors.toList());
+List> metadataBasedColStats = 
metadataTableBasedContext
+.getSortedColumnStatsList(partitionPath, latestBaseFilenameList);
+List> fsBasedColStats = fsBasedContext
+.getSortedColumnStatsList(partitionPath, latestBaseFilenameList);
 
-HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder()
-.enable(enableMetadataTable)
-.withAssumeDatePartitioning(cfg.assumeDatePartitioning)
-.build();
+validate(metadataBasedColStats, fsBasedColStats, partitionPath, "column 
stats");
 
-return FileSystemViewManager.createInMemoryFileSystemView(engineContext,
-metaClient, metadataConfig);
+LOG.info("Validation of column stats succeeded for partition " + 
partitionPath);
+  }
+
+  private void validateBloomFilters(
+  HoodieMetadataValidationContext metadataTableBasedContext,
+  HoodieMetadataValidationContext fsBasedContext, String partitionPath) {
+List latestBaseFilenameList = 
fsBasedContext.getSortedLatestBaseFileList(partitionPath)
+.stream().map(BaseFile::getFileName).collect(Collectors.toList());
+List metadataBasedBloomFilters = metadataTableBasedContext

Review comment:
   same question for `latestBaseFilenameList ` mentioned before.

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -438,27 +481,62 @@ private void 
validateLatestBaseFiles(HoodieTableFileSystemView metaFsView, Hoodi
   /**
* Compare getLatestFileSlices between metadata table and fileSystem.
*/
-  private void validateLatestFileSlices(HoodieTableFileSystemView metaFsView, 
HoodieTableFileSystemView fsView, String partitionPath) {
+  private void validateLatestFileSlices(
+  HoodieMetadataValidationContext metadataTableBasedContext,
+  HoodieMetadataValidationContext fsBasedContext, String partitionPath) {
 
-List latestFileSlicesFromMetadataTable = 
metaFsView.getLatestFileSlices(partitionPath).sorted(new 
FileSliceCompactor()).collect(Collectors.toList());
-List latestFileSlicesFromFS = 
fsView.getLatestFileSlices(partitionPath).sorted(new 
FileSliceCompactor()).collect(Collectors.toList());
+List latestFileSlicesFromMetadataTable = 
metadataTableBasedContext.getSortedLatestFileSliceList(partitionPath);
+List latestFileSlicesFromFS = 
fsBasedContext.getSortedLatestFileSliceList(partitionPath)

[jira] [Comment Edited] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209
 ] 

Yann Byron edited comment on HUDI-3201 at 2/24/22, 6:49 AM:


h4. `hoodie.datasource.write.partitionpath.urlencode` can affect this behavior.


was (Author: biyan900...@gmail.com):
h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior.

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209
 ] 

Yann Byron commented on HUDI-3201:
--

h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior.

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3201.

Resolution: Fixed

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3202) Add keygen to support partition discovery

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3202.

Resolution: Won't Do

not necessary to add another keygen for this. this behavior can be controlled 
by `hoodie.datasource.write.partitionpath.urlencode`. Making it enable by 
default can work.

> Add keygen to support partition discovery
> -
>
> Key: HUDI-3202
> URL: https://issues.apache.org/jira/browse/HUDI-3202
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[hudi] branch master updated (62605be -> 943b997)

2022-02-23 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 62605be  [HUDI-3480][HUDI-3481] Enchancements to integ test suite 
(#4884)
 add 943b997  [HUDI-3488] The flink small file list should exclude file 
slices with pending compaction (#4893)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[jira] [Comment Edited] (HUDI-3488) The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497156#comment-17497156
 ] 

Danny Chen edited comment on HUDI-3488 at 2/24/22, 6:45 AM:


Fixed via master branch: 943b99775b7b0cf4340a9af76b6dd235cf91e350


was (Author: danny0405):
Fixed via master branch: 3102cf7bbb8f81bc5cc92a01b4f65061c945deea

> The flink small file list should exclude file slices with pending compaction
> 
>
> Key: HUDI-3488
> URL: https://issues.apache.org/jira/browse/HUDI-3488
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yanenze
>Priority: Blocker
>  Labels: flink, hudi, pull-request-available
> Fix For: 0.11.0
>
>
> when we use async-compaction files with flink, bucketAssigner find small file 
> list , is lost the file which is in pendingCompaction, so the total size only 
> caculate the (log file size * compressratio (0.35))



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] danny0405 merged pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


danny0405 merged pull request #4893:
URL: https://github.com/apache/hudi/pull/4893


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanenze edited a comment on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


yanenze edited a comment on pull request #4879:
URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049543930


   > hello, can you re-submit the PR into master branch: i didn't see that you 
submit the PR into release-0.10.1
   
   hello, i have re-submitted in PR #4893 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4897:
URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049544024


   
   ## CI report:
   
   * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4897:
URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049542836


   
   ## CI report:
   
   * 3e08c375dd84084f1cf54fd35417deec4602ba1d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanenze commented on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


yanenze commented on pull request #4879:
URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049543930


   > hello, can you re-submit the PR into master branch: i didn't see that you 
submit the PR into release-0.10.1
   
   hello, i have committed in PR #4893 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4897:
URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049542836


   
   ## CI report:
   
   * 3e08c375dd84084f1cf54fd35417deec4602ba1d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x

2022-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3341:
-
Labels: HUDI-bug pull-request-available  (was: HUDI-bug)

> Investigate that metadata table cannot be read for hadoop-aws 2.7.x
> ---
>
> Key: HUDI-3341
> URL: https://issues.apache.org/jira/browse/HUDI-3341
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug, pull-request-available
> Fix For: 0.11.0
>
>
> Environment: spark 2.4.4 + aws-java-sdk-1.7.4 + hadoop-aws-2.7.4, Hudi 
> 0.11.0-SNAPSHOT, metadata table enabled
> On the write path, the ingestion is successful with metadata table updated.  
> When trying to read the metadata table for listing, e.g., using hudi-cli, the 
> operation fails with the following exception.
> {code:java}
> Failed to retrieve list of partition from metadata
> org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of 
> partition from metadata
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:110)
>     at 
> org.apache.hudi.cli.commands.MetadataCommand.listPartitions(MetadataCommand.java:208)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>     at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>     at 
> org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
>     at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
> log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:334)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:179)
>     at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:103)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:246)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:376)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$4(HoodieBackedTableMetadata.java:292)
>     at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:282)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$0(HoodieBackedTableMetadata.java:138)
>     at java.util.HashMap.forEach(HashMap.java:1289)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:137)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:127)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:275)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:108)
>     ... 12 more
> Caused by: org.apache.hudi.exception.HoodieIOException: IOException when 
> reading logblock from log file 
> HoodieLogFile{pathStr='s3a://hudi-testing/metadata_test_table_2/.hoodie/metadata/files/.files-_00.log.1_0-0-0',
>  fileLen=-1}
>     at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:375)
>     at 
> org.apache.hudi.common.table.log.HoodieLogFormatReader.next(HoodieLogFormatReader.java:120)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:211)
>     ... 27 more
> Caused by: java.io.IOException: Attempted read on closed str

[GitHub] [hudi] yihua opened a new pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x

2022-02-23 Thread GitBox


yihua opened a new pull request #4897:
URL: https://github.com/apache/hudi/pull/4897


   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049499769


   
   ## CI report:
   
   * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049536716


   
   ## CI report:
   
   * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4894:
URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049535340


   
   ## CI report:
   
   * 7643d40678709453d546d59836d6ac4fee21779a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4894:
URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049506313


   
   ## CI report:
   
   * 7643d40678709453d546d59836d6ac4fee21779a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049525991


   
   ## CI report:
   
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049524299


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   * 04cab32dafc945234ad9876b940fa27aebb3f69f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049524299


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   * 04cab32dafc945234ad9876b940fa27aebb3f69f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049518789


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Gatsby-Lee opened a new issue #4896: [SUPPORT] Metadata Table causes missing data.

2022-02-23 Thread GitBox


Gatsby-Lee opened a new issue #4896:
URL: https://github.com/apache/hudi/issues/4896


   **Describe the problem you faced**
   
   Regardless the table type ( CoW, MoR ), I notice missing data when Metadata 
Table is enabled.
   
   For example, If I ingest 100,000 records ( no dups ) with the batch size 
10,000, the ingested records in Hudi are not 100,000.
   
   I checked the number or records through Amazon Athena and also 
double-checked the count by running Spark Job as well.
   
   **Full Configuration**
   
   ```
   {
'className': 'org.apache.hudi'
'hoodie.datasource.hive_sync.database': 'hudi_exp'
'hoodie.datasource.hive_sync.enable': 'true'
'hoodie.datasource.hive_sync.support_timestamp': 'true'
'hoodie.datasource.hive_sync.table': 'hudi_etl_exp'
'hoodie.datasource.hive_sync.use_jdbc': 'false'
'hoodie.datasource.write.hive_style_partitioning': 'true'
'hoodie.datasource.write.partitionpath.field': 'org_id'
'hoodie.datasource.write.recordkey.field': 'obj_id'
'hoodie.table.name': 'hudi_etl_exp'
'hoodie.bulkinsert.shuffle.parallelism': '24'
'hoodie.delete.shuffle.parallelism': '24'
'hoodie.insert.shuffle.parallelism': '24'
'hoodie.upsert.shuffle.parallelism': '24'
'hoodie.index.type': 'BLOOM'
'hoodie.bloom.index.prune.by.ranges': 'true'
'hoodie.datasource.clustering.async.enable': 'false'
'hoodie.datasource.clustering.inline.enable': 'false'
'hoodie.datasource.compaction.async.enable': 'false'
'hoodie.clean.automatic': 'true'
'hoodie.clean.async': 'true'
'hoodie.keep.max.commits': 40
'hoodie.keep.min.commits': 30
'hoodie.cleaner.commits.retained': 20
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'
'hoodie.compact.inline': 'false'
'hoodie.clustering.async.enabled': 'false'
'hoodie.clustering.async.max.commits': 4
'hoodie.clustering.inline': 'false'
'hoodie.metadata.clean.async': 'true'
'hoodie.cleaner.policy.failed.writes': 'LAZY'
'hoodie.write.concurrency.mode': 'OPTIMISTIC_CONCURRENCY_CONTROL'
'hoodie.write.lock.provider': 
'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider'
'hoodie.write.lock.zookeeper.port': '2181'
'hoodie.write.lock.zookeeper.url': 'zookeeper_url'
'hoodie.write.lock.zookeeper.base_path': 'zookeeper_base_path'
'hoodie.write.lock.zookeeper.lock_key': 'hudi_etl_exp'
'path': 's3://hello-hudi/hudi_exp/hudi_etl_exp'
'hoodie.datasource.write.precombine.field': '_etl_cluster_ts'
'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.SimpleKeyGenerator'
'hoodie.datasource.hive_sync.partition_fields': 'org_id'
'hoodie.combine.before.upsert': 'true'
'hoodie.datasource.write.operation': 'upsert'
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
'hoodie.table.type': 'COPY_ON_WRITE'
'hoodie.metadata.enable': 'true'
   }
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. generates random 100 records
   2. ingest 10 records per batch
   3. count number of ingested records ( 10, 20, 30 )
   
   
   **Expected behavior**
   
   The all 100 records have to be on Hudi Tables
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.1-amzn-0
   
   * Hive version : 2.3.7-amzn-4
   
   * Hadoop version : 3.2.1-amzn-3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049517406


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049518789


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049489672


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049517406


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   * 3dd778d4de48d9728846db6264d0e0fc7720d6cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

2022-02-23 Thread GitBox


Gatsby-Lee commented on issue #4873:
URL: https://github.com/apache/hudi/issues/4873#issuecomment-1049512318


   @nsivabalan I have a question.
   In the reported config, there are three fields.
   Do all three fields have to be "timestamp characteristics"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

2022-02-23 Thread GitBox


Gatsby-Lee commented on issue #4873:
URL: https://github.com/apache/hudi/issues/4873#issuecomment-1049511235


   @cafelo-pfdrive it is sth that increase incrementally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4895:
URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049507780


   
   ## CI report:
   
   * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4895:
URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049506331


   
   ## CI report:
   
   * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4894:
URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049504911


   
   ## CI report:
   
   * 7643d40678709453d546d59836d6ac4fee21779a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4894:
URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049506313


   
   ## CI report:
   
   * 7643d40678709453d546d59836d6ac4fee21779a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4895:
URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049506331


   
   ## CI report:
   
   * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] BruceKellan commented on issue #4892: [SUPPORT] Rollback files not deleted using spark

2022-02-23 Thread GitBox


BruceKellan commented on issue #4892:
URL: https://github.com/apache/hudi/issues/4892#issuecomment-1049505772


   @nsivabalan Thanks for your reply.
   Are you mean the requirement to archive only counts rollback not rollback 
and commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3483) Add insert overwrite tests for spark DS

2022-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3483:
-
Labels: pull-request-available  (was: )

> Add insert overwrite tests for spark DS
> ---
>
> Key: HUDI-3483
> URL: https://issues.apache.org/jira/browse/HUDI-3483
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] nsivabalan opened a new pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups

2022-02-23 Thread GitBox


nsivabalan opened a new pull request #4895:
URL: https://github.com/apache/hudi/pull/4895


   ## What is the purpose of the pull request
   
   - Added insert override nodes and yamls to integ test suite
   - Minor clean ups
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4894:
URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049504911


   
   ## CI report:
   
   * 7643d40678709453d546d59836d6ac4fee21779a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3493) Not table to get execution plan

2022-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3493:
-
Labels: pull-request-available  (was: )

> Not table to get execution plan
> ---
>
> Key: HUDI-3493
> URL: https://issues.apache.org/jira/browse/HUDI-3493
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
>
> link to this question
> https://github.com/apache/hudi/issues/4859



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] XuQianJin-Stars opened a new pull request #4894: [HUDI-3493] Not table to get execution plan

2022-02-23 Thread GitBox


XuQianJin-Stars opened a new pull request #4894:
URL: https://github.com/apache/hudi/pull/4894


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   link: [HUDI-3493](https://issues.apache.org/jira/browse/HUDI-3493)
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049498405


   
   ## CI report:
   
   * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049499769


   
   ## CI report:
   
   * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4892: [SUPPORT] Rollback files not deleted using spark

2022-02-23 Thread GitBox


nsivabalan commented on issue #4892:
URL: https://github.com/apache/hudi/issues/4892#issuecomment-1049498963


   rollback files don't get deleted immediately. it has to meet the requirement 
for archival. "hoodie.keep.max.commits" will come into play here. Once your 
rollback instants count reaches that threshold, it will get archived. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049498405


   
   ## CI report:
   
   * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanenze commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


yanenze commented on pull request #4893:
URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049497783


   @danny0405 hello,i re-submit PR to the master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanenze opened a new pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


yanenze opened a new pull request #4893:
URL: https://github.com/apache/hudi/pull/4893


   …ompaction
   
   # this happen when the async-compaction has been configured
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] VIKASPATID commented on issue #4635: [SUPPORT] Bulk write failing due to hudi timeline archive exception

2022-02-23 Thread GitBox


VIKASPATID commented on issue #4635:
URL: https://github.com/apache/hudi/issues/4635#issuecomment-1049491206


   Hi @nsivabalan,
   Here is the reproducible code
   
pyspark script 
   
   ```
   
   from pyspark.context import SparkContext
   from pyspark.sql.session import SparkSession
   from pyspark.sql.functions import col, to_timestamp, 
monotonically_increasing_id, to_date, when
   from pyspark.sql.types import *
   import time
   from pyspark.sql.functions import lit
   from pyspark.sql.functions import col, when, expr
   import argparse
   import threading
   
   spark = SparkSession.builder.config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer').config('spark.sql.hive.convertMetastoreParquet',
 'false').getOrCreate()
   sc = spark.sparkContext
   
   table_name = None
   table_path = None
   
   header = [["A0", "STRING"], ["A1", "STRING"], ["A2", "STRING"], ["A3", 
"STRING"], ["A4", "STRING"], ["A5", "INTEGER"], ["A6", "INTEGER"], ["A7", 
"SHORT"], ["A8", "INTEGER"], ["A9", "LONG"], ["A10", "DOUBLE"], ["A11", 
"INTEGER"], ["A12", "LONG"], ["A13", "DOUBLE"], ["A14", "LONG"], ["A15", 
"DOUBLE"], ["A16", "DOUBLE"], ["A17", "INTEGER"], ["A18", "SHORT"], ["A19", 
"DOUBLE"], ["A20", "INTEGER"], ["A21", "SHORT"], ["A22", "DOUBLE"], ["A23", 
"STRING"], ["A24", "STRING"], ["A25", "INTEGER"], ["A26", "INTEGER"], ["A27", 
"STRING"], ["A28", "INTEGER"], ["A29", "INTEGER"], ["A30", "STRING"], ["A31", 
"DOUBLE"], ["A32", "DOUBLE"], ["A33", "STRING"], ["A34", "DOUBLE"], ["A35", 
"INTEGER"], ["A36", "SHORT"], ["A37", "STRING"], ["A38", "DOUBLE"], ["A39", 
"STRING"], ["A40", "STRING"], ["A41", "STRING"], ["A42", "STRING"], ["A43", 
"STRING"], ["A44", "INTEGER"], ["A45", "LONG"], ["A46", "LONG"], ["A47", 
"LONG"], ["A48", "LONG"], ["A49", "LONG"], ["A50", "LONG"], ["A51", "INTEGER"], 
["A52", "INTEGER
 "], ["A53", "INTEGER"], ["A54", "INTEGER"], ["A55", "INTEGER"], ["A56", 
"DOUBLE"], ["A57", "DOUBLE"], ["A58", "DOUBLE"], ["A59", "DOUBLE"], ["A60", 
"LONG"], ["A61", "STRING"], ["A62", "DOUBLE"], ["A63", "STRING"], ["A64", 
"DOUBLE"], ["A65", "DOUBLE"], ["A66", "LONG"], ["A67", "LONG"]]
   
   common_config = {
   'className' : 'org.apache.hudi',
   'hoodie.write.concurrency.mode': 'optimistic_concurrency_control',
   
'hoodie.write.lock.provider':'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider',
   'hoodie.cleaner.policy.failed.writes':'LAZY',
   'hoodie.write.lock.zookeeper.url':'xxx',
   'hoodie.write.lock.zookeeper.port':'2181',
   'hoodie.write.lock.zookeeper.lock_key': f"{table_name}",
   'hoodie.write.lock.zookeeper.base_path':'/hudi',
   'hoodie.datasource.write.row.writer.enable': 'false',
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
   'hoodie.datasource.write.recordkey.field': 'A1,A9',
   'hoodie.datasource.write.partitionpath.field': 'A2,A5',
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
   'hoodie.datasource.write.precombine.field': "A5",
   'hoodie.datasource.hive_sync.use_jdbc': 'false',
   'hoodie.datasource.hive_sync.enable': 'false',
   'hoodie.compaction.payload.class': 
'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
   'hoodie.datasource.hive_sync.table': f"{table_name}",
   'hoodie.datasource.hive_sync.partition_fields': 'A2,A5',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.copyonwrite.record.size.estimate': 256,
   'hoodie.write.lock.client.wait_time_ms': 1000,
   'hoodie.write.lock.client.num_retries': 50
   }
   
   init_load_config = {
   'hoodie.parquet.max.file.size': 1024*1024*1024,
   'hoodie.bulkinsert.shuffle.parallelism': 10,
   'compactionSmallFileSize': 100*1024*1024,
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.write.markers.type': "DIRECT"
   #'hoodie.compact.inline': True
   # 'hoodie.datasource.write.insert.drop.duplicates' : 'true'
   }
   
   increamental_config = {
   'hoodie.upsert.shuffle.parallelism': 1,
   'hoodie.insert.shuffle.parallelism': 1,
   'hoodie.cleaner.commits.retained': 1,
   'hoodie.clean.automatic': True
   }
   
   def get_parameters():
   parser = argparse.ArgumentParser(
   description='Usage: --table_path= 
--table_name=')
   parser.add_argument('--table_path', help='table_path', required=True)
   parser.add_argument('--table_name', help='table_name', required=True)
   (args, unknown) = parser.parse_known_args()
   return args
   
   def main():
   global table_path
   global table_name
   
   params   = get_parameters()
   table_path   = params.table_path
   table_name   = params.table_name
   common_config['hoodie.table.name'] = table_name
   common_config['hoodie.datasource.hive_syn

[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049467912


   
   ## CI report:
   
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049489672


   
   ## CI report:
   
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1049456121


   
   ## CI report:
   
   * 11f1b688459ab9017ebde2a38d1645e0f59b50c3 UNKNOWN
   * c243f70d774b7ecb059dad4bb03870b2c2d4436b UNKNOWN
   * e1771f831ae0a59baf39497b337fe304be901149 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6253)
 
   * 2790e24a229e808602113c7ed80932b09e56c8fd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6255)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4739:
URL: https://github.com/apache/hudi/pull/4739#issuecomment-1049487198


   
   ## CI report:
   
   * 11f1b688459ab9017ebde2a38d1645e0f59b50c3 UNKNOWN
   * c243f70d774b7ecb059dad4bb03870b2c2d4436b UNKNOWN
   * 2790e24a229e808602113c7ed80932b09e56c8fd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6255)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch release-0.10.1 updated (3102cf7 -> 84fb390)

2022-02-23 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch release-0.10.1
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 3102cf7  [HUDI-3488] The flink small file list should exclude file 
slices with pending compaction (#4879)

This update removed existing revisions from the reference, leaving the
reference pointing at a previous point in the repository history.

 * -- * -- N   refs/heads/release-0.10.1 (84fb390)
\
 O -- O -- O   (3102cf7)

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[GitHub] [hudi] danny0405 commented on a change in pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


danny0405 commented on a change in pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#discussion_r812620201



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
##
@@ -475,12 +475,12 @@ private void writeToBuffer(HoodieRecord record) {
 }
 Option indexedRecord = getIndexedRecord(record);
 if (indexedRecord.isPresent()) {
-  // Skip the Ignore Record.
+  // Skip the ignored record.
   if (!indexedRecord.get().equals(IGNORE_RECORD)) {
 recordList.add(indexedRecord.get());
   }
 } else {
-  keysToDelete.add(record.getKey());
+  keysToDelete.add(DeleteKey.create(record.getKey(), 
record.getData().getOrderingVal()));
 }

Review comment:
   Hello, @vinothchandar , can you take a look if you have time ? The only 
concern is that the new encoding/decoding breaks the compatibility. I have no 
good idea how to be compatible yet.
   
   But i want to address that before the patch, the handle may cause data lost. 
Comparing to compatibility, correctness is more important i think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


danny0405 commented on pull request #4879:
URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049472059


   hello, can you re-submit the PR into master branch: i didn't see that you 
submit the PR into release-0.10.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch release-0.10.1 updated: [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4879)

2022-02-23 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch release-0.10.1
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.10.1 by this push:
 new 3102cf7  [HUDI-3488] The flink small file list should exclude file 
slices with pending compaction (#4879)
3102cf7 is described below

commit 3102cf7bbb8f81bc5cc92a01b4f65061c945deea
Author: yanenze <34880077+yane...@users.noreply.github.com>
AuthorDate: Thu Feb 24 11:58:26 2022 +0800

[HUDI-3488] The flink small file list should exclude file slices with 
pending compaction (#4879)

* [HUDI-3488] The flink small file list should exclude file slices with 
pending compaction
# this happen when the async-compaction has been configured

Co-authored-by: yanenze 
---
 .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java
 
b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java
index 97b6b23..aad775a 100644
--- 
a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java
+++ 
b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java
@@ -59,7 +59,7 @@ public class DeltaWriteProfile extends WriteProfile {
   List allSmallFileSlices = new ArrayList<>();
   // If we can index log files, we can add more inserts to log files for 
fileIds including those under
   // pending compaction.
-  List allFileSlices = 
fsView.getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitTime.getTimestamp(), true)
+  List allFileSlices = 
fsView.getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitTime.getTimestamp(), false)
   .collect(Collectors.toList());
   for (FileSlice fileSlice : allFileSlices) {
 if (isSmallFile(fileSlice)) {


[jira] [Commented] (HUDI-3488) The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497156#comment-17497156
 ] 

Danny Chen commented on HUDI-3488:
--

Fixed via master branch: 3102cf7bbb8f81bc5cc92a01b4f65061c945deea

> The flink small file list should exclude file slices with pending compaction
> 
>
> Key: HUDI-3488
> URL: https://issues.apache.org/jira/browse/HUDI-3488
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yanenze
>Priority: Blocker
>  Labels: flink, hudi, pull-request-available
> Fix For: 0.11.0
>
>
> when we use async-compaction files with flink, bucketAssigner find small file 
> list , is lost the file which is in pendingCompaction, so the total size only 
> caculate the (log file size * compressratio (0.35))



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3488) The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-3488:
-
Fix Version/s: 0.11.0
   (was: 0.10.1)

> The flink small file list should exclude file slices with pending compaction
> 
>
> Key: HUDI-3488
> URL: https://issues.apache.org/jira/browse/HUDI-3488
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yanenze
>Priority: Blocker
>  Labels: flink, hudi, pull-request-available
> Fix For: 0.11.0
>
>
> when we use async-compaction files with flink, bucketAssigner find small file 
> list , is lost the file which is in pendingCompaction, so the total size only 
> caculate the (log file size * compressratio (0.35))



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x

2022-02-23 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497155#comment-17497155
 ] 

Ethan Guo commented on HUDI-3341:
-

After trying to avoid seeking the end of file, I hit another exception:
{code:java}
Failed to retrieve list of partition from metadata
org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of 
partition from metadata
    at 
org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:110)
    at 
org.apache.hudi.cli.commands.MetadataCommand.listPartitions(MetadataCommand.java:208)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
    at 
org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
    at 
org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
    at 
org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
    at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
    at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
log file 
    at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:335)
    at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:180)
    at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:104)
    at 
org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71)
    at 
org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51)
    at 
org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:246)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:377)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$4(HoodieBackedTableMetadata.java:293)
    at 
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:283)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$0(HoodieBackedTableMetadata.java:139)
    at java.util.HashMap.forEach(HashMap.java:1289)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:138)
    at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:128)
    at 
org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:275)
    at 
org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:108)
    ... 12 more
Caused by: org.apache.hudi.exception.HoodieIOException: IOException when 
reading logblock from log file 
HoodieLogFile{pathStr='s3a://hudi-testing/metadata_test_table_2/.hoodie/metadata/files/.files-_00.log.1_0-0-0',
 fileLen=-1}
    at 
org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:376)
    at 
org.apache.hudi.common.table.log.HoodieLogFormatReader.next(HoodieLogFormatReader.java:120)
    at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:212)
    ... 27 more
Caused by: org.apache.http.ConnectionClosedException: Premature end of 
Content-Length delimited message body (expected: 1; received: 0
    at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
    at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
    at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
    at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
    at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
    at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at java.io.FilterInputStream.close(FilterInputStream.java:181)
    at 
com.amazonaws.services.s3.

[GitHub] [hudi] danny0405 merged pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction

2022-02-23 Thread GitBox


danny0405 merged pull request #4879:
URL: https://github.com/apache/hudi/pull/4879


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049466870


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049467912


   
   ## CI report:
   
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on issue #4890: [SUPPORT]

2022-02-23 Thread GitBox


danny0405 commented on issue #4890:
URL: https://github.com/apache/hudi/issues/4890#issuecomment-1049467551


   Hello, do you mean the scala compile problem ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049466870


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049462749


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049465789


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4752:
URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049438140


   
   ## CI report:
   
   * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN
   * 8e39a758e427838f341b40a482bade2bae6e6af7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6238)
 
   * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] BruceKellan opened a new issue #4892: [SUPPORT]Rollback files not deleted using spark

2022-02-23 Thread GitBox


BruceKellan opened a new issue #4892:
URL: https://github.com/apache/hudi/issues/4892


   **To Reproduce**
   Steps to reproduce the behavior:
   1. start a spark structured streaming application using hudi
   2. restart application manually.
   3. rollback files not deleted in `.hoodie` directory.
   
   https://user-images.githubusercontent.com/13477122/155453783-8a24bb80-5b6b-49d1-b563-38d6e171b42e.png";>
   
   **Expected behavior**
   The rollback file is deleted when the application start.
   
   **Environment Description**
   Hudi version : 0.10.0
   Spark version : 3.2.0
   Storage (HDFS/S3/GCS..) : AliyunOSS
   Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049462749


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049461602


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-3492) Not table to get execution plan

2022-02-23 Thread Forward Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu closed HUDI-3492.

Resolution: Duplicate

> Not table to get execution plan
> ---
>
> Key: HUDI-3492
> URL: https://issues.apache.org/jira/browse/HUDI-3492
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>
> link to this question
> https://github.com/apache/hudi/issues/4859



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3493) Not table to get execution plan

2022-02-23 Thread Forward Xu (Jira)
Forward Xu created HUDI-3493:


 Summary: Not table to get execution plan
 Key: HUDI-3493
 URL: https://issues.apache.org/jira/browse/HUDI-3493
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Forward Xu
Assignee: Forward Xu


link to this question

https://github.com/apache/hudi/issues/4859



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3492) Not table to get execution plan

2022-02-23 Thread Forward Xu (Jira)
Forward Xu created HUDI-3492:


 Summary: Not table to get execution plan
 Key: HUDI-3492
 URL: https://issues.apache.org/jira/browse/HUDI-3492
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Forward Xu
Assignee: Forward Xu


link to this question

https://github.com/apache/hudi/issues/4859



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049461602


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   * eea6fd0986803e32c09be1b39b8e0281e62f3b99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-02-23 Thread GitBox


hudi-bot removed a comment on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1048583870


   
   ## CI report:
   
   * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] FelixKJose opened a new issue #4891: Clustering not working on large table and partitions

2022-02-23 Thread GitBox


FelixKJose opened a new issue #4891:
URL: https://github.com/apache/hudi/issues/4891


   I am having a large partitioned MOR Hudi table and I have tried to perform 
async clustering using hudi clustering utility but it failed without any stack 
trace. Then I tried inline clustering but the clustering job failed with OOM 
error.  The clustering was performed on 365 partitions and each partition was 
having 518 million records and each record (without compression) is 3-4 KB. 
Then I tried to perform clustering on 10 partitions and it worked but it seems 
like, the clustering is getting all the data for those partitions into Driver 
memory after sorting and then partitioned back to worker nodes for writing. 
   
   1. How does normally people perform inline clustering or async clustering on 
partitions with large amount of data? Do you expect driver memory should be 
larger than the clustering data size? 
   2. What are the configurations I should be using to perform clustering on 
these large tables?
   3. In PROD I will have 1.8 billion records (each record 3-4 KB in memory), 
so is it advised to perform clustering frequently (every 10 to 20 commits) or 
daily?
   4. Does MOR table supports async clustering with OCC assurance?
   
   
   My config:
   "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   "hoodie.datasource.write.precombine.field": "eventDateTime",
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.datasource.write.operation": "bulk_insert",
   "hoodie.table.name": "flattened_calculations_mor_awstest_clust",
   "hoodie.datasource.write.recordkey.field": "identifier",
   "hoodie.datasource.hive_sync.table": 
"flattened_calculations_mor_awstest_clust",
   "hoodie.datasource.write.partitionpath.field": 
"observationEndDate",
   "hoodie.datasource.hive_sync.partition_fields": 
"observationEndDate",
   "hoodie.insert.shuffle.parallelism": 7050,
   "hoodie.bulkinsert.shuffle.parallelism": 7050,
   "hoodie.parquet.small.file.limit": 0,
   "hoodie.datasource.clustering.inline.enable": "true",
   "hoodie.clustering.inline.max.commits": 1,
   "hoodie.clustering.plan.strategy.target.file.max.bytes": 
1073741824,
   "hoodie.clustering.plan.strategy.small.file.limit": 629145600,
   "hoodie.cleaner.commits.retained": 1,
   "hoodie.keep.min.commits": 2,
   "hoodie.compact.inline": "true",
   "hoodie.clustering.plan.strategy.sort.columns": 
"patientIdentifier_identifier_value",
   "hoodie.clustering.plan.strategy.daybased.lookback.partitions": 
365
   
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.0
   
   * AWS EMR: 6.5.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   **Additional context**
   
   For async clustering via Hudi Util:
   sudo -s spark-submit \
   --class org.apache.hudi.utilities.HoodieClusteringJob \
   /usr/lib/hudi/hudi-utilities-bundle.jar \
   --props s3://**/aws/config/clusteringjob.properties \
   --mode scheduleAndExecute \
   --base-path 
s3://**/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/
 \
   --table-name flattened_calculations_mor_awstest2_s \
   --spark-memory 12g
   
   
   ==clusteringjob.properties==
   
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=4
   hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
   hoodie.clustering.plan.strategy.small.file.limit=629145600
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   
hoodie.clustering.plan.strategy.sort.columns=patientIdentifier_identifier_value
   
   I am getting following error:
   
   **Stacktrace**
   
   ```22/02/08 20:17:52 INFO Javalin: Starting Javalin ...
   22/02/08 20:17:52 INFO Javalin: Listening on http://localhost:46705/
   22/02/08 20:17:52 INFO Javalin: Javalin started in 192ms 💃
   22/02/08 20:17:52 INFO S3NativeFileSystem: Opening 
's3://*/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/.hoodie/hoodie.properties'
 for reading
   22/02/08 20:17:52 INFO Javalin: Stopping Javalin ...
   22/02/08 20:17:52 INFO Javalin: Javalin has stopped
   22/02/08 20:17:52 ERROR HoodieClusteringJob: Clustering with basePath: 
s3://*/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/,
 tableName: flattened_calculations_mor_awstest2_s, runningMode: 
scheduleAndExecute failed
   22/02/08 20:17:52 INFO AbstractConnector: Stopped Spark@5d66941f{HTTP/1.1, 
(http/1.1)}{0.0.0.0:4040}
   22/02/08 20:17:52 INFO SparkUI: Stopped Spark web UI at 
http://ip-10-57-102-186.ec2.internal:4040/
   22/02/08 20:17:52 INFO YarnClientSchedulerBackend: Interrupting m

  1   2   3   4   5   >