[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
hudi-bot commented on pull request #4897: URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049580652 ## CI report: * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
hudi-bot removed a comment on pull request #4897: URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049544024 ## CI report: * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot removed a comment on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049556727 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049580454 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6267) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #4789: [HUDI-1296] Support Metadata Table in Spark Datasource
vinothchandar commented on a change in pull request #4789: URL: https://github.com/apache/hudi/pull/4789#discussion_r813615713 ## File path: hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieAvroDeserializerTrait.scala ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro + +/** + * Deserializes Avro payload into Catalyst object + * + * NOTE: This is low-level component operating on Spark internal data-types (comprising [[InternalRow]]). + * If you're looking to convert Avro into "deserialized" [[Row]] (comprised of Java native types), + * please check [[AvroConversionUtils]] + */ +trait HoodieAvroDeserializerTrait { + final def deserialize(data: Any): Option[Any] = +doDeserialize(data) match { + case opt: Option[_] => opt// As of Spark 3.1, this will return data wrapped with Option, so we fetch the data Review comment: if there is code specific to spark versions, can you move them into the adapters? how does this work with spark 2.x? ## File path: hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala ## @@ -1,380 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.hudi - -import java.nio.ByteBuffer -import java.sql.{Date, Timestamp} -import java.time.Instant - -import org.apache.avro.Conversions.DecimalConversion -import org.apache.avro.LogicalTypes.{TimestampMicros, TimestampMillis} -import org.apache.avro.Schema.Type._ -import org.apache.avro.generic.GenericData.{Fixed, Record} -import org.apache.avro.generic.{GenericData, GenericFixed, GenericRecord} -import org.apache.avro.{LogicalTypes, Schema} - -import org.apache.spark.sql.Row -import org.apache.spark.sql.avro.SchemaConverters -import org.apache.spark.sql.catalyst.expressions.GenericRow -import org.apache.spark.sql.catalyst.util.DateTimeUtils -import org.apache.spark.sql.types._ - -import org.apache.hudi.AvroConversionUtils._ -import org.apache.hudi.exception.HoodieIncompatibleSchemaException - -import scala.collection.JavaConverters._ - -object AvroConversionHelper { - - private def createDecimal(decimal: java.math.BigDecimal, precision: Int, scale: Int): Decimal = { -if (precision <= Decimal.MAX_LONG_DIGITS) { - // Constructs a `Decimal` with an unscaled `Long` value if possible. - Decimal(decimal.unscaledValue().longValue(), precision, scale) -} else { - // Otherwise, resorts to an unscaled `BigInteger` instead. - Decimal(decimal, precision, scale) -} - } - - /** -* -* Returns a converter function to convert row in avro format to GenericRow of catalyst. -* -* @param sourceAvroSchema Source schema before conversion inferred from avro file by passed in -* by user. -* @param targetSqlTypeTarget catalyst sql type after the conversion. -* @return returns a converter function to convert row in avro format to GenericRow of catalyst. -*/ - def createConverterToRow(sourceAvroSchema: Schema, Review comment: what happens to all this code? consolidated? if so, are we sure there are no subtle differences from the different conversion implementations? ## File path: hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala ## @@ -18,20 +18,105 @@ package org.apache.hudi -imp
[GitHub] [hudi] LinMingQiang commented on pull request #4724: [HUDI-2815] add partial overwrite payload to support partial overwrit…
LinMingQiang commented on pull request #4724: URL: https://github.com/apache/hudi/pull/4724#issuecomment-1049573640 Do I need to modify the preCombine in the HoodieMergedLogRecordScanner.processNextRecord method? What I understand is that when we read a log file, we need to do deduplication and also call preCombine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Guanpx closed issue #4658: [SUPPORT] Data lose with Flink write COW insert table, Flink web UI show Records Received was different with HIVE count(1)
Guanpx closed issue #4658: URL: https://github.com/apache/hudi/issues/4658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Guanpx commented on issue #4658: [SUPPORT] Data lose with Flink write COW insert table, Flink web UI show Records Received was different with HIVE count(1)
Guanpx commented on issue #4658: URL: https://github.com/apache/hudi/issues/4658#issuecomment-1049571213 should use 'write.operation' = 'insert'; but in my code was 'hoodie.datasource.write.operation' = 'insert', hudi will use default config: 'write.operation' = 'upsert', so close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1869) Upgrading Spark3 To 3.1
[ https://issues.apache.org/jira/browse/HUDI-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-1869. > Upgrading Spark3 To 3.1 > --- > > Key: HUDI-1869 > URL: https://issues.apache.org/jira/browse/HUDI-1869 > Project: Apache Hudi > Issue Type: Task > Components: spark >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > > Spark 3.1 has changed some behavior of the internal class and interface for > both spark-sql and spark-core module. > Currently hudi can't compile success under the spark 3.1. We need support sql > support for spark 3.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-1832) Support Hoodie CLI Command In Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-1832. Resolution: Duplicate These will be supported by call produce command. > Support Hoodie CLI Command In Spark SQL > --- > > Key: HUDI-1832 > URL: https://issues.apache.org/jira/browse/HUDI-1832 > Project: Apache Hudi > Issue Type: Sub-task > Components: spark >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Major > > Move the Hoodie CLI command to spark sql. The syntax just like the follow: > {code:java} > CLI_COMMAND [ (param_key1 = value1, param_key2 = value2...) ] > {code} > e.g. > {code:java} > commits showcommit > showfiles (commit = ‘20210114221306’, limit = 10)show > rollbackssavepoint create (commit = ‘20210114221306’) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
hudi-bot removed a comment on pull request #4895: URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049507780 ## CI report: * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
hudi-bot commented on pull request #4895: URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049556954 ## CI report: * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot removed a comment on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049465789 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049556727 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) * d8cb63488cb5331624fdefe82e4e88d5af1c18a9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049525991 ## CI report: * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049554997 ## CI report: * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4898: rdd unpersist optimization
hudi-bot commented on pull request #4898: URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049549978 ## CI report: * d8b342602971c50db10d090945f8d2e757951852 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6266) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4898: rdd unpersist optimization
hudi-bot removed a comment on pull request #4898: URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049548486 ## CI report: * d8b342602971c50db10d090945f8d2e757951852 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4898: rdd unpersist optimization
hudi-bot commented on pull request #4898: URL: https://github.com/apache/hudi/pull/4898#issuecomment-1049548486 ## CI report: * d8b342602971c50db10d090945f8d2e757951852 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2482) Support drop partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-2482. > Support drop partitions SQL > --- > > Key: HUDI-2482 > URL: https://issues.apache.org/jira/browse/HUDI-2482 > Project: Apache Hudi > Issue Type: Task > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-2456) Support show partitions SQL
[ https://issues.apache.org/jira/browse/HUDI-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-2456. > Support show partitions SQL > --- > > Key: HUDI-2456 > URL: https://issues.apache.org/jira/browse/HUDI-2456 > Project: Apache Hudi > Issue Type: Task > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: features, pull-request-available > Fix For: 0.10.0 > > > Spark SQL support the following syntax to show hudi tabls's partitions. > {code:java} > SHOW PARTITIONS tableIdentifier partitionSpec?{code} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] scxwhite opened a new pull request #4898: rdd unpersist optimization
scxwhite opened a new pull request #4898: URL: https://github.com/apache/hudi/pull/4898 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request When we start multiple hudi jobs in a sparkSession, or cache some rdds before a hudi job starts, SparkRDDWriteClient#releaseResources will release all persistent rdds, causing spark to recalculate. So I think we should add the switch of whether to release all rdds. ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2538) Persist configs to hoodie.properties on the first write
[ https://issues.apache.org/jira/browse/HUDI-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-2538. > Persist configs to hoodie.properties on the first write > --- > > Key: HUDI-2538 > URL: https://issues.apache.org/jira/browse/HUDI-2538 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Some configs, like `keygenerator.class`, `hive_style_partitioning`, > `partitionpath.urlencode` should be persisted to hoodie.properties when write > data in the first time. Otherwise, some inconsistent behavior will happen. > And the other write operation do not need to provide these configs. If > configs provided don't match the existing configs, raise exceptions. > And, this is also useful to solve some of the keyGenerator discrepancy issues > between DataFrame writer and SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4878: [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator
zhangyue19921010 commented on a change in pull request #4878: URL: https://github.com/apache/hudi/pull/4878#discussion_r813588000 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -500,19 +578,179 @@ public int compare(FileSlice o1, FileSlice o2) { } } - public static class HoodieBaseFileCompactor implements Comparator, Serializable { + public static class HoodieBaseFileComparator implements Comparator, Serializable { @Override public int compare(HoodieBaseFile o1, HoodieBaseFile o2) { return o1.getPath().compareTo(o2.getPath()); } } - public static class HoodieFileGroupCompactor implements Comparator, Serializable { + public static class HoodieFileGroupComparator implements Comparator, Serializable { @Override public int compare(HoodieFileGroup o1, HoodieFileGroup o2) { return o1.getFileGroupId().compareTo(o2.getFileGroupId()); } } -} \ No newline at end of file + + public static class HoodieColumnRangeMetadataComparator + implements Comparator>, Serializable { + +@Override +public int compare(HoodieColumnRangeMetadata o1, HoodieColumnRangeMetadata o2) { + return o1.toString().compareTo(o2.toString()); +} + } + + /** + * Class for storing relevant information for metadata table validation. + * + * If metadata table is disabled, the APIs provide the information, e.g., file listing, + * index, from the file system and base files. If metadata table is enabled, the APIs + * provide the information from the metadata table. The same API is expected to return + * the same information regardless of whether metadata table is enabled, which is + * verified in the {@link HoodieMetadataTableValidator}. + */ + private static class HoodieMetadataValidationContext implements Serializable { +private HoodieTableMetaClient metaClient; +private HoodieTableFileSystemView fileSystemView; +private HoodieTableMetadata tableMetadata; +private boolean enableMetadataTable; +private List allColumnNameList; + +public HoodieMetadataValidationContext( +HoodieEngineContext engineContext, Config cfg, HoodieTableMetaClient metaClient, +boolean enableMetadataTable) { + this.metaClient = metaClient; + this.enableMetadataTable = enableMetadataTable; + HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder() + .enable(enableMetadataTable) + .withMetadataIndexBloomFilter(enableMetadataTable) + .withMetadataIndexColumnStats(enableMetadataTable) + .withMetadataIndexForAllColumns(enableMetadataTable) + .withAssumeDatePartitioning(cfg.assumeDatePartitioning) + .build(); + this.fileSystemView = FileSystemViewManager.createInMemoryFileSystemView(engineContext, + metaClient, metadataConfig); + this.tableMetadata = HoodieTableMetadata.create(engineContext, metadataConfig, metaClient.getBasePath(), + FileSystemViewStorageConfig.SPILLABLE_DIR.defaultValue()); + if (metaClient.getCommitsTimeline().filterCompletedInstants().countInstants() > 0) { +this.allColumnNameList = getAllColumnNames(); + } +} + +public List getSortedLatestBaseFileList(String partitionPath) { + return fileSystemView.getLatestBaseFiles(partitionPath) + .sorted(new HoodieBaseFileComparator()).collect(Collectors.toList()); +} + +public List getSortedLatestFileSliceList(String partitionPath) { + return fileSystemView.getLatestFileSlices(partitionPath) + .sorted(new FileSliceComparator()).collect(Collectors.toList()); +} + +public List getSortedAllFileGroupList(String partitionPath) { + return fileSystemView.getAllFileGroups(partitionPath) + .sorted(new HoodieFileGroupComparator()).collect(Collectors.toList()); +} + +public List> getSortedColumnStatsList( +String partitionPath, List baseFileNameList) { + LOG.info("All column names for getting column stats: " + allColumnNameList); + if (enableMetadataTable) { +List> partitionFileNameList = baseFileNameList.stream() +.map(filename -> Pair.of(partitionPath, filename)).collect(Collectors.toList()); +return allColumnNameList.stream() +.flatMap(columnName -> +tableMetadata.getColumnStats(partitionFileNameList, columnName).values().stream() +.map(stats -> new HoodieColumnRangeMetadata<>( +stats.getFileName(), +columnName, +stats.getMinValue(), +stats.getMaxValue(), +stats.getNullCount(), +stats.getValueCount(), +stats.getTotalSize(), +stats.getTotalUncompressedSize())) +.collect(Collectors.
[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4878: [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator
zhangyue19921010 commented on a change in pull request #4878: URL: https://github.com/apache/hudi/pull/4878#discussion_r813589004 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -438,27 +481,62 @@ private void validateLatestBaseFiles(HoodieTableFileSystemView metaFsView, Hoodi /** * Compare getLatestFileSlices between metadata table and fileSystem. */ - private void validateLatestFileSlices(HoodieTableFileSystemView metaFsView, HoodieTableFileSystemView fsView, String partitionPath) { + private void validateLatestFileSlices( + HoodieMetadataValidationContext metadataTableBasedContext, + HoodieMetadataValidationContext fsBasedContext, String partitionPath) { -List latestFileSlicesFromMetadataTable = metaFsView.getLatestFileSlices(partitionPath).sorted(new FileSliceCompactor()).collect(Collectors.toList()); -List latestFileSlicesFromFS = fsView.getLatestFileSlices(partitionPath).sorted(new FileSliceCompactor()).collect(Collectors.toList()); +List latestFileSlicesFromMetadataTable = metadataTableBasedContext.getSortedLatestFileSliceList(partitionPath); +List latestFileSlicesFromFS = fsBasedContext.getSortedLatestFileSliceList(partitionPath); -LOG.info("Latest file list from metadata: " + latestFileSlicesFromMetadataTable + ". For partition " + partitionPath); -LOG.info("Latest file list from direct listing: " + latestFileSlicesFromFS + ". For partition " + partitionPath); +LOG.debug("Latest file list from metadata: " + latestFileSlicesFromMetadataTable + ". For partition " + partitionPath); +LOG.debug("Latest file list from direct listing: " + latestFileSlicesFromFS + ". For partition " + partitionPath); -validateFileSlice(latestFileSlicesFromMetadataTable, latestFileSlicesFromFS, partitionPath); +validate(latestFileSlicesFromMetadataTable, latestFileSlicesFromFS, partitionPath, "file slices"); LOG.info("Validation of getLatestFileSlices succeeded for partition " + partitionPath); } - private HoodieTableFileSystemView createHoodieTableFileSystemView(HoodieSparkEngineContext engineContext, boolean enableMetadataTable) { + private void validateAllColumnStats( + HoodieMetadataValidationContext metadataTableBasedContext, + HoodieMetadataValidationContext fsBasedContext, String partitionPath) { +List latestBaseFilenameList = fsBasedContext.getSortedLatestBaseFileList(partitionPath) +.stream().map(BaseFile::getFileName).collect(Collectors.toList()); +List> metadataBasedColStats = metadataTableBasedContext +.getSortedColumnStatsList(partitionPath, latestBaseFilenameList); +List> fsBasedColStats = fsBasedContext +.getSortedColumnStatsList(partitionPath, latestBaseFilenameList); -HoodieMetadataConfig metadataConfig = HoodieMetadataConfig.newBuilder() -.enable(enableMetadataTable) -.withAssumeDatePartitioning(cfg.assumeDatePartitioning) -.build(); +validate(metadataBasedColStats, fsBasedColStats, partitionPath, "column stats"); -return FileSystemViewManager.createInMemoryFileSystemView(engineContext, -metaClient, metadataConfig); +LOG.info("Validation of column stats succeeded for partition " + partitionPath); + } + + private void validateBloomFilters( + HoodieMetadataValidationContext metadataTableBasedContext, + HoodieMetadataValidationContext fsBasedContext, String partitionPath) { +List latestBaseFilenameList = fsBasedContext.getSortedLatestBaseFileList(partitionPath) +.stream().map(BaseFile::getFileName).collect(Collectors.toList()); +List metadataBasedBloomFilters = metadataTableBasedContext Review comment: same question for `latestBaseFilenameList ` mentioned before. ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java ## @@ -438,27 +481,62 @@ private void validateLatestBaseFiles(HoodieTableFileSystemView metaFsView, Hoodi /** * Compare getLatestFileSlices between metadata table and fileSystem. */ - private void validateLatestFileSlices(HoodieTableFileSystemView metaFsView, HoodieTableFileSystemView fsView, String partitionPath) { + private void validateLatestFileSlices( + HoodieMetadataValidationContext metadataTableBasedContext, + HoodieMetadataValidationContext fsBasedContext, String partitionPath) { -List latestFileSlicesFromMetadataTable = metaFsView.getLatestFileSlices(partitionPath).sorted(new FileSliceCompactor()).collect(Collectors.toList()); -List latestFileSlicesFromFS = fsView.getLatestFileSlices(partitionPath).sorted(new FileSliceCompactor()).collect(Collectors.toList()); +List latestFileSlicesFromMetadataTable = metadataTableBasedContext.getSortedLatestFileSliceList(partitionPath); +List latestFileSlicesFromFS = fsBasedContext.getSortedLatestFileSliceList(partitionPath)
[jira] [Comment Edited] (HUDI-3201) Make partition auto discovery configurable
[ https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209 ] Yann Byron edited comment on HUDI-3201 at 2/24/22, 6:49 AM: h4. `hoodie.datasource.write.partitionpath.urlencode` can affect this behavior. was (Author: biyan900...@gmail.com): h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior. > Make partition auto discovery configurable > -- > > Key: HUDI-3201 > URL: https://issues.apache.org/jira/browse/HUDI-3201 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Critical > Labels: user-support-issues > Fix For: 0.11.0 > > Original Estimate: 1h > Remaining Estimate: 1h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3201) Make partition auto discovery configurable
[ https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209 ] Yann Byron commented on HUDI-3201: -- h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior. > Make partition auto discovery configurable > -- > > Key: HUDI-3201 > URL: https://issues.apache.org/jira/browse/HUDI-3201 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Critical > Labels: user-support-issues > Fix For: 0.11.0 > > Original Estimate: 1h > Remaining Estimate: 1h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3201) Make partition auto discovery configurable
[ https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-3201. Resolution: Fixed > Make partition auto discovery configurable > -- > > Key: HUDI-3201 > URL: https://issues.apache.org/jira/browse/HUDI-3201 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Critical > Labels: user-support-issues > Fix For: 0.11.0 > > Original Estimate: 1h > Remaining Estimate: 1h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3202) Add keygen to support partition discovery
[ https://issues.apache.org/jira/browse/HUDI-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-3202. Resolution: Won't Do not necessary to add another keygen for this. this behavior can be controlled by `hoodie.datasource.write.partitionpath.urlencode`. Making it enable by default can work. > Add keygen to support partition discovery > - > > Key: HUDI-3202 > URL: https://issues.apache.org/jira/browse/HUDI-3202 > Project: Apache Hudi > Issue Type: Improvement > Components: spark, writer-core >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Critical > Labels: user-support-issues > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[hudi] branch master updated (62605be -> 943b997)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 62605be [HUDI-3480][HUDI-3481] Enchancements to integ test suite (#4884) add 943b997 [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4893) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[jira] [Comment Edited] (HUDI-3488) The flink small file list should exclude file slices with pending compaction
[ https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497156#comment-17497156 ] Danny Chen edited comment on HUDI-3488 at 2/24/22, 6:45 AM: Fixed via master branch: 943b99775b7b0cf4340a9af76b6dd235cf91e350 was (Author: danny0405): Fixed via master branch: 3102cf7bbb8f81bc5cc92a01b4f65061c945deea > The flink small file list should exclude file slices with pending compaction > > > Key: HUDI-3488 > URL: https://issues.apache.org/jira/browse/HUDI-3488 > Project: Apache Hudi > Issue Type: Improvement >Reporter: yanenze >Priority: Blocker > Labels: flink, hudi, pull-request-available > Fix For: 0.11.0 > > > when we use async-compaction files with flink, bucketAssigner find small file > list , is lost the file which is in pendingCompaction, so the total size only > caculate the (log file size * compressratio (0.35)) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] danny0405 merged pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
danny0405 merged pull request #4893: URL: https://github.com/apache/hudi/pull/4893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanenze edited a comment on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
yanenze edited a comment on pull request #4879: URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049543930 > hello, can you re-submit the PR into master branch: i didn't see that you submit the PR into release-0.10.1 hello, i have re-submitted in PR #4893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
hudi-bot commented on pull request #4897: URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049544024 ## CI report: * 3e08c375dd84084f1cf54fd35417deec4602ba1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6264) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
hudi-bot removed a comment on pull request #4897: URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049542836 ## CI report: * 3e08c375dd84084f1cf54fd35417deec4602ba1d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanenze commented on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
yanenze commented on pull request #4879: URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049543930 > hello, can you re-submit the PR into master branch: i didn't see that you submit the PR into release-0.10.1 hello, i have committed in PR #4893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
hudi-bot commented on pull request #4897: URL: https://github.com/apache/hudi/pull/4897#issuecomment-1049542836 ## CI report: * 3e08c375dd84084f1cf54fd35417deec4602ba1d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x
[ https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3341: - Labels: HUDI-bug pull-request-available (was: HUDI-bug) > Investigate that metadata table cannot be read for hadoop-aws 2.7.x > --- > > Key: HUDI-3341 > URL: https://issues.apache.org/jira/browse/HUDI-3341 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: HUDI-bug, pull-request-available > Fix For: 0.11.0 > > > Environment: spark 2.4.4 + aws-java-sdk-1.7.4 + hadoop-aws-2.7.4, Hudi > 0.11.0-SNAPSHOT, metadata table enabled > On the write path, the ingestion is successful with metadata table updated. > When trying to read the metadata table for listing, e.g., using hudi-cli, the > operation fails with the following exception. > {code:java} > Failed to retrieve list of partition from metadata > org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of > partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:110) > at > org.apache.hudi.cli.commands.MetadataCommand.listPartitions(MetadataCommand.java:208) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) > at > org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68) > at > org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59) > at > org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134) > at > org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) > at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: Exception when reading > log file > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:334) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:179) > at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:103) > at > org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71) > at > org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51) > at > org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:246) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:376) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$4(HoodieBackedTableMetadata.java:292) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:282) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$0(HoodieBackedTableMetadata.java:138) > at java.util.HashMap.forEach(HashMap.java:1289) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:137) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:127) > at > org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:275) > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:108) > ... 12 more > Caused by: org.apache.hudi.exception.HoodieIOException: IOException when > reading logblock from log file > HoodieLogFile{pathStr='s3a://hudi-testing/metadata_test_table_2/.hoodie/metadata/files/.files-_00.log.1_0-0-0', > fileLen=-1} > at > org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:375) > at > org.apache.hudi.common.table.log.HoodieLogFormatReader.next(HoodieLogFormatReader.java:120) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:211) > ... 27 more > Caused by: java.io.IOException: Attempted read on closed str
[GitHub] [hudi] yihua opened a new pull request #4897: [WIP][HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x
yihua opened a new pull request #4897: URL: https://github.com/apache/hudi/pull/4897 ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
hudi-bot removed a comment on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049499769 ## CI report: * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
hudi-bot commented on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049536716 ## CI report: * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan
hudi-bot commented on pull request #4894: URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049535340 ## CI report: * 7643d40678709453d546d59836d6ac4fee21779a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4894: [HUDI-3493] Not table to get execution plan
hudi-bot removed a comment on pull request #4894: URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049506313 ## CI report: * 7643d40678709453d546d59836d6ac4fee21779a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049525991 ## CI report: * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) * 04cab32dafc945234ad9876b940fa27aebb3f69f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6263) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049524299 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) * 04cab32dafc945234ad9876b940fa27aebb3f69f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049524299 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) * 04cab32dafc945234ad9876b940fa27aebb3f69f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049518789 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee opened a new issue #4896: [SUPPORT] Metadata Table causes missing data.
Gatsby-Lee opened a new issue #4896: URL: https://github.com/apache/hudi/issues/4896 **Describe the problem you faced** Regardless the table type ( CoW, MoR ), I notice missing data when Metadata Table is enabled. For example, If I ingest 100,000 records ( no dups ) with the batch size 10,000, the ingested records in Hudi are not 100,000. I checked the number or records through Amazon Athena and also double-checked the count by running Spark Job as well. **Full Configuration** ``` { 'className': 'org.apache.hudi' 'hoodie.datasource.hive_sync.database': 'hudi_exp' 'hoodie.datasource.hive_sync.enable': 'true' 'hoodie.datasource.hive_sync.support_timestamp': 'true' 'hoodie.datasource.hive_sync.table': 'hudi_etl_exp' 'hoodie.datasource.hive_sync.use_jdbc': 'false' 'hoodie.datasource.write.hive_style_partitioning': 'true' 'hoodie.datasource.write.partitionpath.field': 'org_id' 'hoodie.datasource.write.recordkey.field': 'obj_id' 'hoodie.table.name': 'hudi_etl_exp' 'hoodie.bulkinsert.shuffle.parallelism': '24' 'hoodie.delete.shuffle.parallelism': '24' 'hoodie.insert.shuffle.parallelism': '24' 'hoodie.upsert.shuffle.parallelism': '24' 'hoodie.index.type': 'BLOOM' 'hoodie.bloom.index.prune.by.ranges': 'true' 'hoodie.datasource.clustering.async.enable': 'false' 'hoodie.datasource.clustering.inline.enable': 'false' 'hoodie.datasource.compaction.async.enable': 'false' 'hoodie.clean.automatic': 'true' 'hoodie.clean.async': 'true' 'hoodie.keep.max.commits': 40 'hoodie.keep.min.commits': 30 'hoodie.cleaner.commits.retained': 20 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS' 'hoodie.compact.inline': 'false' 'hoodie.clustering.async.enabled': 'false' 'hoodie.clustering.async.max.commits': 4 'hoodie.clustering.inline': 'false' 'hoodie.metadata.clean.async': 'true' 'hoodie.cleaner.policy.failed.writes': 'LAZY' 'hoodie.write.concurrency.mode': 'OPTIMISTIC_CONCURRENCY_CONTROL' 'hoodie.write.lock.provider': 'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider' 'hoodie.write.lock.zookeeper.port': '2181' 'hoodie.write.lock.zookeeper.url': 'zookeeper_url' 'hoodie.write.lock.zookeeper.base_path': 'zookeeper_base_path' 'hoodie.write.lock.zookeeper.lock_key': 'hudi_etl_exp' 'path': 's3://hello-hudi/hudi_exp/hudi_etl_exp' 'hoodie.datasource.write.precombine.field': '_etl_cluster_ts' 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor' 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator' 'hoodie.datasource.hive_sync.partition_fields': 'org_id' 'hoodie.combine.before.upsert': 'true' 'hoodie.datasource.write.operation': 'upsert' 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE' 'hoodie.table.type': 'COPY_ON_WRITE' 'hoodie.metadata.enable': 'true' } ``` **To Reproduce** Steps to reproduce the behavior: 1. generates random 100 records 2. ingest 10 records per batch 3. count number of ingested records ( 10, 20, 30 ) **Expected behavior** The all 100 records have to be on Hudi Tables **Environment Description** * Hudi version : 0.9.0 * Spark version : 3.1.1-amzn-0 * Hive version : 2.3.7-amzn-4 * Hadoop version : 3.2.1-amzn-3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049517406 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049518789 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6262) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049489672 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049517406 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) * 3dd778d4de48d9728846db6264d0e0fc7720d6cb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue
Gatsby-Lee commented on issue #4873: URL: https://github.com/apache/hudi/issues/4873#issuecomment-1049512318 @nsivabalan I have a question. In the reported config, there are three fields. Do all three fields have to be "timestamp characteristics"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue
Gatsby-Lee commented on issue #4873: URL: https://github.com/apache/hudi/issues/4873#issuecomment-1049511235 @cafelo-pfdrive it is sth that increase incrementally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
hudi-bot commented on pull request #4895: URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049507780 ## CI report: * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6261) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
hudi-bot removed a comment on pull request #4895: URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049506331 ## CI report: * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4894: [HUDI-3493] Not table to get execution plan
hudi-bot removed a comment on pull request #4894: URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049504911 ## CI report: * 7643d40678709453d546d59836d6ac4fee21779a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan
hudi-bot commented on pull request #4894: URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049506313 ## CI report: * 7643d40678709453d546d59836d6ac4fee21779a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6260) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
hudi-bot commented on pull request #4895: URL: https://github.com/apache/hudi/pull/4895#issuecomment-1049506331 ## CI report: * 4151cdaa42adc5f2ae56b3ca8f397e4ba702bb33 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan commented on issue #4892: [SUPPORT] Rollback files not deleted using spark
BruceKellan commented on issue #4892: URL: https://github.com/apache/hudi/issues/4892#issuecomment-1049505772 @nsivabalan Thanks for your reply. Are you mean the requirement to archive only counts rollback not rollback and commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3483) Add insert overwrite tests for spark DS
[ https://issues.apache.org/jira/browse/HUDI-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3483: - Labels: pull-request-available (was: ) > Add insert overwrite tests for spark DS > --- > > Key: HUDI-3483 > URL: https://issues.apache.org/jira/browse/HUDI-3483 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan opened a new pull request #4895: [HUDI-3483] Adding insert override nodes to integ test suite and few clean ups
nsivabalan opened a new pull request #4895: URL: https://github.com/apache/hudi/pull/4895 ## What is the purpose of the pull request - Added insert override nodes and yamls to integ test suite - Minor clean ups ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4894: [HUDI-3493] Not table to get execution plan
hudi-bot commented on pull request #4894: URL: https://github.com/apache/hudi/pull/4894#issuecomment-1049504911 ## CI report: * 7643d40678709453d546d59836d6ac4fee21779a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3493) Not table to get execution plan
[ https://issues.apache.org/jira/browse/HUDI-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3493: - Labels: pull-request-available (was: ) > Not table to get execution plan > --- > > Key: HUDI-3493 > URL: https://issues.apache.org/jira/browse/HUDI-3493 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > Labels: pull-request-available > > link to this question > https://github.com/apache/hudi/issues/4859 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] XuQianJin-Stars opened a new pull request #4894: [HUDI-3493] Not table to get execution plan
XuQianJin-Stars opened a new pull request #4894: URL: https://github.com/apache/hudi/pull/4894 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request link: [HUDI-3493](https://issues.apache.org/jira/browse/HUDI-3493) ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
hudi-bot removed a comment on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049498405 ## CI report: * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
hudi-bot commented on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049499769 ## CI report: * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6259) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4892: [SUPPORT] Rollback files not deleted using spark
nsivabalan commented on issue #4892: URL: https://github.com/apache/hudi/issues/4892#issuecomment-1049498963 rollback files don't get deleted immediately. it has to meet the requirement for archival. "hoodie.keep.max.commits" will come into play here. Once your rollback instants count reaches that threshold, it will get archived. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
hudi-bot commented on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049498405 ## CI report: * ad3affb39d3bf5c74a78cf4bcf92567a37aad580 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanenze commented on pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
yanenze commented on pull request #4893: URL: https://github.com/apache/hudi/pull/4893#issuecomment-1049497783 @danny0405 hello,i re-submit PR to the master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanenze opened a new pull request #4893: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
yanenze opened a new pull request #4893: URL: https://github.com/apache/hudi/pull/4893 …ompaction # this happen when the async-compaction has been configured ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] VIKASPATID commented on issue #4635: [SUPPORT] Bulk write failing due to hudi timeline archive exception
VIKASPATID commented on issue #4635: URL: https://github.com/apache/hudi/issues/4635#issuecomment-1049491206 Hi @nsivabalan, Here is the reproducible code pyspark script ``` from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when from pyspark.sql.types import * import time from pyspark.sql.functions import lit from pyspark.sql.functions import col, when, expr import argparse import threading spark = SparkSession.builder.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer').config('spark.sql.hive.convertMetastoreParquet', 'false').getOrCreate() sc = spark.sparkContext table_name = None table_path = None header = [["A0", "STRING"], ["A1", "STRING"], ["A2", "STRING"], ["A3", "STRING"], ["A4", "STRING"], ["A5", "INTEGER"], ["A6", "INTEGER"], ["A7", "SHORT"], ["A8", "INTEGER"], ["A9", "LONG"], ["A10", "DOUBLE"], ["A11", "INTEGER"], ["A12", "LONG"], ["A13", "DOUBLE"], ["A14", "LONG"], ["A15", "DOUBLE"], ["A16", "DOUBLE"], ["A17", "INTEGER"], ["A18", "SHORT"], ["A19", "DOUBLE"], ["A20", "INTEGER"], ["A21", "SHORT"], ["A22", "DOUBLE"], ["A23", "STRING"], ["A24", "STRING"], ["A25", "INTEGER"], ["A26", "INTEGER"], ["A27", "STRING"], ["A28", "INTEGER"], ["A29", "INTEGER"], ["A30", "STRING"], ["A31", "DOUBLE"], ["A32", "DOUBLE"], ["A33", "STRING"], ["A34", "DOUBLE"], ["A35", "INTEGER"], ["A36", "SHORT"], ["A37", "STRING"], ["A38", "DOUBLE"], ["A39", "STRING"], ["A40", "STRING"], ["A41", "STRING"], ["A42", "STRING"], ["A43", "STRING"], ["A44", "INTEGER"], ["A45", "LONG"], ["A46", "LONG"], ["A47", "LONG"], ["A48", "LONG"], ["A49", "LONG"], ["A50", "LONG"], ["A51", "INTEGER"], ["A52", "INTEGER "], ["A53", "INTEGER"], ["A54", "INTEGER"], ["A55", "INTEGER"], ["A56", "DOUBLE"], ["A57", "DOUBLE"], ["A58", "DOUBLE"], ["A59", "DOUBLE"], ["A60", "LONG"], ["A61", "STRING"], ["A62", "DOUBLE"], ["A63", "STRING"], ["A64", "DOUBLE"], ["A65", "DOUBLE"], ["A66", "LONG"], ["A67", "LONG"]] common_config = { 'className' : 'org.apache.hudi', 'hoodie.write.concurrency.mode': 'optimistic_concurrency_control', 'hoodie.write.lock.provider':'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider', 'hoodie.cleaner.policy.failed.writes':'LAZY', 'hoodie.write.lock.zookeeper.url':'xxx', 'hoodie.write.lock.zookeeper.port':'2181', 'hoodie.write.lock.zookeeper.lock_key': f"{table_name}", 'hoodie.write.lock.zookeeper.base_path':'/hudi', 'hoodie.datasource.write.row.writer.enable': 'false', 'hoodie.table.name': table_name, 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': 'A1,A9', 'hoodie.datasource.write.partitionpath.field': 'A2,A5', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.precombine.field': "A5", 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.hive_sync.enable': 'false', 'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload', 'hoodie.datasource.hive_sync.table': f"{table_name}", 'hoodie.datasource.hive_sync.partition_fields': 'A2,A5', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.copyonwrite.record.size.estimate': 256, 'hoodie.write.lock.client.wait_time_ms': 1000, 'hoodie.write.lock.client.num_retries': 50 } init_load_config = { 'hoodie.parquet.max.file.size': 1024*1024*1024, 'hoodie.bulkinsert.shuffle.parallelism': 10, 'compactionSmallFileSize': 100*1024*1024, 'hoodie.datasource.write.operation': 'bulk_insert', 'hoodie.write.markers.type': "DIRECT" #'hoodie.compact.inline': True # 'hoodie.datasource.write.insert.drop.duplicates' : 'true' } increamental_config = { 'hoodie.upsert.shuffle.parallelism': 1, 'hoodie.insert.shuffle.parallelism': 1, 'hoodie.cleaner.commits.retained': 1, 'hoodie.clean.automatic': True } def get_parameters(): parser = argparse.ArgumentParser( description='Usage: --table_path= --table_name=') parser.add_argument('--table_path', help='table_path', required=True) parser.add_argument('--table_name', help='table_name', required=True) (args, unknown) = parser.parse_known_args() return args def main(): global table_path global table_name params = get_parameters() table_path = params.table_path table_name = params.table_name common_config['hoodie.table.name'] = table_name common_config['hoodie.datasource.hive_syn
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049467912 ## CI report: * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049489672 ## CI report: * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot removed a comment on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1049456121 ## CI report: * 11f1b688459ab9017ebde2a38d1645e0f59b50c3 UNKNOWN * c243f70d774b7ecb059dad4bb03870b2c2d4436b UNKNOWN * e1771f831ae0a59baf39497b337fe304be901149 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6253) * 2790e24a229e808602113c7ed80932b09e56c8fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6255) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS
hudi-bot commented on pull request #4739: URL: https://github.com/apache/hudi/pull/4739#issuecomment-1049487198 ## CI report: * 11f1b688459ab9017ebde2a38d1645e0f59b50c3 UNKNOWN * c243f70d774b7ecb059dad4bb03870b2c2d4436b UNKNOWN * 2790e24a229e808602113c7ed80932b09e56c8fd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6255) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch release-0.10.1 updated (3102cf7 -> 84fb390)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch release-0.10.1 in repository https://gitbox.apache.org/repos/asf/hudi.git. discard 3102cf7 [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4879) This update removed existing revisions from the reference, leaving the reference pointing at a previous point in the repository history. * -- * -- N refs/heads/release-0.10.1 (84fb390) \ O -- O -- O (3102cf7) Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. No new revisions were added by this update. Summary of changes: .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[GitHub] [hudi] danny0405 commented on a change in pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
danny0405 commented on a change in pull request #4880: URL: https://github.com/apache/hudi/pull/4880#discussion_r812620201 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java ## @@ -475,12 +475,12 @@ private void writeToBuffer(HoodieRecord record) { } Option indexedRecord = getIndexedRecord(record); if (indexedRecord.isPresent()) { - // Skip the Ignore Record. + // Skip the ignored record. if (!indexedRecord.get().equals(IGNORE_RECORD)) { recordList.add(indexedRecord.get()); } } else { - keysToDelete.add(record.getKey()); + keysToDelete.add(DeleteKey.create(record.getKey(), record.getData().getOrderingVal())); } Review comment: Hello, @vinothchandar , can you take a look if you have time ? The only concern is that the new encoding/decoding breaks the compatibility. I have no good idea how to be compatible yet. But i want to address that before the patch, the handle may cause data lost. Comparing to compatibility, correctness is more important i think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
danny0405 commented on pull request #4879: URL: https://github.com/apache/hudi/pull/4879#issuecomment-1049472059 hello, can you re-submit the PR into master branch: i didn't see that you submit the PR into release-0.10.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch release-0.10.1 updated: [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4879)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch release-0.10.1 in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/release-0.10.1 by this push: new 3102cf7 [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4879) 3102cf7 is described below commit 3102cf7bbb8f81bc5cc92a01b4f65061c945deea Author: yanenze <34880077+yane...@users.noreply.github.com> AuthorDate: Thu Feb 24 11:58:26 2022 +0800 [HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4879) * [HUDI-3488] The flink small file list should exclude file slices with pending compaction # this happen when the async-compaction has been configured Co-authored-by: yanenze --- .../org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java index 97b6b23..aad775a 100644 --- a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java +++ b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/DeltaWriteProfile.java @@ -59,7 +59,7 @@ public class DeltaWriteProfile extends WriteProfile { List allSmallFileSlices = new ArrayList<>(); // If we can index log files, we can add more inserts to log files for fileIds including those under // pending compaction. - List allFileSlices = fsView.getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitTime.getTimestamp(), true) + List allFileSlices = fsView.getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitTime.getTimestamp(), false) .collect(Collectors.toList()); for (FileSlice fileSlice : allFileSlices) { if (isSmallFile(fileSlice)) {
[jira] [Commented] (HUDI-3488) The flink small file list should exclude file slices with pending compaction
[ https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497156#comment-17497156 ] Danny Chen commented on HUDI-3488: -- Fixed via master branch: 3102cf7bbb8f81bc5cc92a01b4f65061c945deea > The flink small file list should exclude file slices with pending compaction > > > Key: HUDI-3488 > URL: https://issues.apache.org/jira/browse/HUDI-3488 > Project: Apache Hudi > Issue Type: Improvement >Reporter: yanenze >Priority: Blocker > Labels: flink, hudi, pull-request-available > Fix For: 0.11.0 > > > when we use async-compaction files with flink, bucketAssigner find small file > list , is lost the file which is in pendingCompaction, so the total size only > caculate the (log file size * compressratio (0.35)) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3488) The flink small file list should exclude file slices with pending compaction
[ https://issues.apache.org/jira/browse/HUDI-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-3488: - Fix Version/s: 0.11.0 (was: 0.10.1) > The flink small file list should exclude file slices with pending compaction > > > Key: HUDI-3488 > URL: https://issues.apache.org/jira/browse/HUDI-3488 > Project: Apache Hudi > Issue Type: Improvement >Reporter: yanenze >Priority: Blocker > Labels: flink, hudi, pull-request-available > Fix For: 0.11.0 > > > when we use async-compaction files with flink, bucketAssigner find small file > list , is lost the file which is in pendingCompaction, so the total size only > caculate the (log file size * compressratio (0.35)) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x
[ https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497155#comment-17497155 ] Ethan Guo commented on HUDI-3341: - After trying to avoid seeking the end of file, I hit another exception: {code:java} Failed to retrieve list of partition from metadata org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:110) at org.apache.hudi.cli.commands.MetadataCommand.listPartitions(MetadataCommand.java:208) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) at org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68) at org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59) at org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134) at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:335) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:180) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:104) at org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71) at org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51) at org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:246) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:377) at org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$4(HoodieBackedTableMetadata.java:293) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) at org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:283) at org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$0(HoodieBackedTableMetadata.java:139) at java.util.HashMap.forEach(HashMap.java:1289) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:138) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:128) at org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:275) at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:108) ... 12 more Caused by: org.apache.hudi.exception.HoodieIOException: IOException when reading logblock from log file HoodieLogFile{pathStr='s3a://hudi-testing/metadata_test_table_2/.hoodie/metadata/files/.files-_00.log.1_0-0-0', fileLen=-1} at org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:376) at org.apache.hudi.common.table.log.HoodieLogFormatReader.next(HoodieLogFormatReader.java:120) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:212) ... 27 more Caused by: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1; received: 0 at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180) at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200) at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103) at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164) at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228) at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174) at java.io.FilterInputStream.close(FilterInputStream.java:181) at java.io.FilterInputStream.close(FilterInputStream.java:181) at java.io.FilterInputStream.close(FilterInputStream.java:181) at java.io.FilterInputStream.close(FilterInputStream.java:181) at com.amazonaws.services.s3.
[GitHub] [hudi] danny0405 merged pull request #4879: [HUDI-3488] The flink small file list should exclude file slices with pending compaction
danny0405 merged pull request #4879: URL: https://github.com/apache/hudi/pull/4879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049466870 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049467912 ## CI report: * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6257) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #4890: [SUPPORT]
danny0405 commented on issue #4890: URL: https://github.com/apache/hudi/issues/4890#issuecomment-1049467551 Hello, do you mean the scala compile problem ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049466870 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) * 20a345bbefea14120a98d51a8a2fe6aaa002a0b9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049462749 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049465789 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4752: [WIP][HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot removed a comment on pull request #4752: URL: https://github.com/apache/hudi/pull/4752#issuecomment-1049438140 ## CI report: * d5f1fbad92cd451d5ac7cf81f5f8612ff18d85ed UNKNOWN * 8e39a758e427838f341b40a482bade2bae6e6af7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6238) * f5de54968cb90d2c6bbbf85394b4181880b46c02 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6254) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan opened a new issue #4892: [SUPPORT]Rollback files not deleted using spark
BruceKellan opened a new issue #4892: URL: https://github.com/apache/hudi/issues/4892 **To Reproduce** Steps to reproduce the behavior: 1. start a spark structured streaming application using hudi 2. restart application manually. 3. rollback files not deleted in `.hoodie` directory. https://user-images.githubusercontent.com/13477122/155453783-8a24bb80-5b6b-49d1-b563-38d6e171b42e.png";> **Expected behavior** The rollback file is deleted when the application start. **Environment Description** Hudi version : 0.10.0 Spark version : 3.2.0 Storage (HDFS/S3/GCS..) : AliyunOSS Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049462749 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6256) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049461602 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-3492) Not table to get execution plan
[ https://issues.apache.org/jira/browse/HUDI-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Forward Xu closed HUDI-3492. Resolution: Duplicate > Not table to get execution plan > --- > > Key: HUDI-3492 > URL: https://issues.apache.org/jira/browse/HUDI-3492 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > > link to this question > https://github.com/apache/hudi/issues/4859 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3493) Not table to get execution plan
Forward Xu created HUDI-3493: Summary: Not table to get execution plan Key: HUDI-3493 URL: https://issues.apache.org/jira/browse/HUDI-3493 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Forward Xu Assignee: Forward Xu link to this question https://github.com/apache/hudi/issues/4859 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3492) Not table to get execution plan
Forward Xu created HUDI-3492: Summary: Not table to get execution plan Key: HUDI-3492 URL: https://issues.apache.org/jira/browse/HUDI-3492 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Forward Xu Assignee: Forward Xu link to this question https://github.com/apache/hudi/issues/4859 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot commented on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1049461602 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) * eea6fd0986803e32c09be1b39b8e0281e62f3b99 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC
hudi-bot removed a comment on pull request #4880: URL: https://github.com/apache/hudi/pull/4880#issuecomment-1048583870 ## CI report: * 3d7f2d4f3e4ce5c195be0ea9b9fec4edb191525d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6227) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] FelixKJose opened a new issue #4891: Clustering not working on large table and partitions
FelixKJose opened a new issue #4891: URL: https://github.com/apache/hudi/issues/4891 I am having a large partitioned MOR Hudi table and I have tried to perform async clustering using hudi clustering utility but it failed without any stack trace. Then I tried inline clustering but the clustering job failed with OOM error. The clustering was performed on 365 partitions and each partition was having 518 million records and each record (without compression) is 3-4 KB. Then I tried to perform clustering on 10 partitions and it worked but it seems like, the clustering is getting all the data for those partitions into Driver memory after sorting and then partitioned back to worker nodes for writing. 1. How does normally people perform inline clustering or async clustering on partitions with large amount of data? Do you expect driver memory should be larger than the clustering data size? 2. What are the configurations I should be using to perform clustering on these large tables? 3. In PROD I will have 1.8 billion records (each record 3-4 KB in memory), so is it advised to perform clustering frequently (every 10 to 20 commits) or daily? 4. Does MOR table supports async clustering with OCC assurance? My config: "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.precombine.field": "eventDateTime", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.operation": "bulk_insert", "hoodie.table.name": "flattened_calculations_mor_awstest_clust", "hoodie.datasource.write.recordkey.field": "identifier", "hoodie.datasource.hive_sync.table": "flattened_calculations_mor_awstest_clust", "hoodie.datasource.write.partitionpath.field": "observationEndDate", "hoodie.datasource.hive_sync.partition_fields": "observationEndDate", "hoodie.insert.shuffle.parallelism": 7050, "hoodie.bulkinsert.shuffle.parallelism": 7050, "hoodie.parquet.small.file.limit": 0, "hoodie.datasource.clustering.inline.enable": "true", "hoodie.clustering.inline.max.commits": 1, "hoodie.clustering.plan.strategy.target.file.max.bytes": 1073741824, "hoodie.clustering.plan.strategy.small.file.limit": 629145600, "hoodie.cleaner.commits.retained": 1, "hoodie.keep.min.commits": 2, "hoodie.compact.inline": "true", "hoodie.clustering.plan.strategy.sort.columns": "patientIdentifier_identifier_value", "hoodie.clustering.plan.strategy.daybased.lookback.partitions": 365 **Environment Description** * Hudi version : 0.9.0 * Spark version : 3.1.0 * AWS EMR: 6.5.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : NO **Additional context** For async clustering via Hudi Util: sudo -s spark-submit \ --class org.apache.hudi.utilities.HoodieClusteringJob \ /usr/lib/hudi/hudi-utilities-bundle.jar \ --props s3://**/aws/config/clusteringjob.properties \ --mode scheduleAndExecute \ --base-path s3://**/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/ \ --table-name flattened_calculations_mor_awstest2_s \ --spark-memory 12g ==clusteringjob.properties== hoodie.clustering.async.enabled=true hoodie.clustering.async.max.commits=4 hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=629145600 hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.sort.columns=patientIdentifier_identifier_value I am getting following error: **Stacktrace** ```22/02/08 20:17:52 INFO Javalin: Starting Javalin ... 22/02/08 20:17:52 INFO Javalin: Listening on http://localhost:46705/ 22/02/08 20:17:52 INFO Javalin: Javalin started in 192ms 💃 22/02/08 20:17:52 INFO S3NativeFileSystem: Opening 's3://*/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/.hoodie/hoodie.properties' for reading 22/02/08 20:17:52 INFO Javalin: Stopping Javalin ... 22/02/08 20:17:52 INFO Javalin: Javalin has stopped 22/02/08 20:17:52 ERROR HoodieClusteringJob: Clustering with basePath: s3://*/aws/ss2/device_observations/flattened_calculations_mor_awstest2_s/data/, tableName: flattened_calculations_mor_awstest2_s, runningMode: scheduleAndExecute failed 22/02/08 20:17:52 INFO AbstractConnector: Stopped Spark@5d66941f{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 22/02/08 20:17:52 INFO SparkUI: Stopped Spark web UI at http://ip-10-57-102-186.ec2.internal:4040/ 22/02/08 20:17:52 INFO YarnClientSchedulerBackend: Interrupting m