Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105577307 ## CI report: * 81806555cd6c82297f2ff34b81466e653b483a61 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23850) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4732] Add support for confluent schema registry with proto [hudi]
hudi-bot commented on PR #11070: URL: https://github.com/apache/hudi/pull/11070#issuecomment-2105577215 ## CI report: * 4e15df959494651d74e92f6a998d3310bfc91247 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23790) * 86265f6be7c6fbacf53ed76a7b60b2b64d484409 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23851) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4732] Add support for confluent schema registry with proto [hudi]
hudi-bot commented on PR #11070: URL: https://github.com/apache/hudi/pull/11070#issuecomment-2105574435 ## CI report: * 4e15df959494651d74e92f6a998d3310bfc91247 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23790) * 86265f6be7c6fbacf53ed76a7b60b2b64d484409 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105571932 ## CI report: * 3cba812f7db9eabdec2472351f74c91e14ee3767 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23848) * 81806555cd6c82297f2ff34b81466e653b483a61 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23850) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7743] Improve StoragePath usages [hudi]
hudi-bot commented on PR #11189: URL: https://github.com/apache/hudi/pull/11189#issuecomment-2105571920 ## CI report: * 975a7d92617080bb4c32e832796e8d13cd8d9857 UNKNOWN * 76ee9ca6a701a2fcaa70fce9aae46864486c8c45 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23849) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4732] Add support for confluent schema registry with proto [hudi]
yihua commented on code in PR #11070: URL: https://github.com/apache/hudi/pull/11070#discussion_r1597357603 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ProtoKafkaSource.java: ## @@ -63,11 +67,18 @@ public ProtoKafkaSource(TypedProperties props, JavaSparkContext sparkContext, Sp public ProtoKafkaSource(TypedProperties properties, JavaSparkContext sparkContext, SparkSession sparkSession, HoodieIngestionMetrics metrics, StreamContext streamContext) { super(properties, sparkContext, sparkSession, SourceType.PROTO, metrics, new DefaultStreamContext(UtilHelpers.getSchemaProviderForKafkaSource(streamContext.getSchemaProvider(), properties, sparkContext), streamContext.getSourceProfileSupplier())); -checkRequiredConfigProperties(props, Collections.singletonList( -ProtoClassBasedSchemaProviderConfig.PROTO_SCHEMA_CLASS_NAME)); -props.put(NATIVE_KAFKA_KEY_DESERIALIZER_PROP, StringDeserializer.class); -props.put(NATIVE_KAFKA_VALUE_DESERIALIZER_PROP, ByteArrayDeserializer.class); -className = getStringWithAltKeys(props, ProtoClassBasedSchemaProviderConfig.PROTO_SCHEMA_CLASS_NAME); +this.deserializerName = ConfigUtils.getStringWithAltKeys(props, KafkaSourceConfig.KAFKA_PROTO_VALUE_DESERIALIZER_CLASS, true); +if (!deserializerName.equals(ByteArrayDeserializer.class.getName()) && !deserializerName.equals(KafkaProtobufDeserializer.class.getName())) { + throw new HoodieReadFromSourceException("Only ByteArrayDeserializer and KafkaProtobufDeserializer are supported for ProtoKafkaSource"); +} +if (deserializerName.equals(ByteArrayDeserializer.class.getName())) { + checkRequiredConfigProperties(props, Collections.singletonList(ProtoClassBasedSchemaProviderConfig.PROTO_SCHEMA_CLASS_NAME)); + className = getStringWithAltKeys(props, ProtoClassBasedSchemaProviderConfig.PROTO_SCHEMA_CLASS_NAME); +} else { + className = null; Review Comment: Avoid using `null` ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestProtoKafkaSource.java: ## @@ -64,21 +69,24 @@ import java.util.stream.IntStream; import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; +import static org.apache.hudi.utilities.config.KafkaSourceConfig.KAFKA_PROTO_VALUE_DESERIALIZER_CLASS; import static org.junit.jupiter.api.Assertions.assertEquals; /** * Tests against {@link ProtoKafkaSource}. */ public class TestProtoKafkaSource extends BaseTestKafkaSource { + private static final JsonFormat.Printer PRINTER = JsonFormat.printer().omittingInsignificantWhitespace(); private static final Random RANDOM = new Random(); + private static final String MOCK_REGISTRY_URL = "mock://127.0.0.1:8081"; protected TypedProperties createPropsForKafkaSource(String topic, Long maxEventsToReadFromKafkaSource, String resetStrategy) { TypedProperties props = new TypedProperties(); -props.setProperty("hoodie.streamer.source.kafka.topic", topic); +props.setProperty("hoodie.deltastreamer.source.kafka.topic", topic); Review Comment: nit: should not be changed. ## hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestProtoKafkaSource.java: ## @@ -158,7 +187,7 @@ private static List createSampleMessages(int count) { .setPrimitiveFixedSignedLong(RANDOM.nextLong()) .setPrimitiveBoolean(RANDOM.nextBoolean()) .setPrimitiveString(UUID.randomUUID().toString()) - .setPrimitiveBytes(ByteString.copyFrom(getUTF8Bytes(UUID.randomUUID().toString(; + .setPrimitiveBytes(ByteString.copyFrom(UUID.randomUUID().toString().getBytes())); Review Comment: similar here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-4732] Add support for confluent schema registry with proto [hudi]
yihua commented on code in PR #11070: URL: https://github.com/apache/hudi/pull/11070#discussion_r1597356380 ## packaging/hudi-utilities-bundle/pom.xml: ## @@ -133,6 +133,7 @@ io.confluent:common-config io.confluent:common-utils io.confluent:kafka-schema-registry-client + io.confluent:kafka-protobuf-serializer Review Comment: I think if we're not using it in the integ tests, we should not add it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105545276 ## CI report: * 5739605de6bb73d0e3982a335e243fdb356a6031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23844) * 3cba812f7db9eabdec2472351f74c91e14ee3767 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23848) * 81806555cd6c82297f2ff34b81466e653b483a61 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23850) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7743] Improve StoragePath usages [hudi]
hudi-bot commented on PR #11189: URL: https://github.com/apache/hudi/pull/11189#issuecomment-2105545250 ## CI report: * 975a7d92617080bb4c32e832796e8d13cd8d9857 UNKNOWN * 51a199199691df091162a3d8cb71f9ee448b079a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23829) * 76ee9ca6a701a2fcaa70fce9aae46864486c8c45 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23849) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7745] Move Hadoop-dependent util methods to hudi-hadoop-common module [hudi]
hudi-bot commented on PR #11193: URL: https://github.com/apache/hudi/pull/11193#issuecomment-2105540676 ## CI report: * bbbc714d35283e5e743883ae945cfec50f99b226 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23846) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105540664 ## CI report: * 5739605de6bb73d0e3982a335e243fdb356a6031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23844) * 3cba812f7db9eabdec2472351f74c91e14ee3767 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23848) * 81806555cd6c82297f2ff34b81466e653b483a61 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7743] Improve StoragePath usages [hudi]
hudi-bot commented on PR #11189: URL: https://github.com/apache/hudi/pull/11189#issuecomment-2105540642 ## CI report: * 975a7d92617080bb4c32e832796e8d13cd8d9857 UNKNOWN * 51a199199691df091162a3d8cb71f9ee448b079a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23829) * 76ee9ca6a701a2fcaa70fce9aae46864486c8c45 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]
hudi-bot commented on PR #11152: URL: https://github.com/apache/hudi/pull/11152#issuecomment-2105540571 ## CI report: * a5daf71906886e6d8da62abdf2decae1e20b09ef Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23845) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105518790 ## CI report: * 5739605de6bb73d0e3982a335e243fdb356a6031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23844) * 3cba812f7db9eabdec2472351f74c91e14ee3767 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23848) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
jonvex commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597345837 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -853,7 +852,7 @@ object HoodieBaseRelation extends SparkAdapterSupport { val hoodieConfig = new HoodieConfig() hoodieConfig.setValue(USE_NATIVE_HFILE_READER, options.getOrElse(USE_NATIVE_HFILE_READER.key(), USE_NATIVE_HFILE_READER.defaultValue().toString)) - val reader = HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO) + val reader = (new HoodieSparkIOFactory).getReaderFactory(HoodieRecordType.AVRO) Review Comment: Yeah, that would achieve the same thing. Do you think that would be better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
jonvex commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597345901 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkConfUtils.scala: ## @@ -48,4 +50,10 @@ object HoodieSparkConfUtils { .map(HollowCommitHandling.valueOf) .getOrElse(HollowCommitHandling.valueOf(INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT.defaultValue)) } + + def getSparkReaderConfig(): HoodieConfig = { Review Comment: no. good catch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105512736 ## CI report: * 5739605de6bb73d0e3982a335e243fdb356a6031 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23844) * 3cba812f7db9eabdec2472351f74c91e14ee3767 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
jonvex commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597344058 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -64,6 +65,9 @@ class DefaultSource extends RelationProvider // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)" spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", "true") } +//always use spark io factory + spark.sparkContext.hadoopConfiguration.set(HoodieStorageConfig.HOODIE_IO_FACTORY_CLASS.key(), + classOf[HoodieSparkIOFactory].getName) Review Comment: SparkSQLWriter also enters here, so I think maybe we need to add the config in deltastreamer as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions (#10872)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 49072d1e2e7 [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions (#10872) 49072d1e2e7 is described below commit 49072d1e2e721f27623dba840ad6ea41a252fd15 Author: Vinish Reddy AuthorDate: Sat May 11 08:50:59 2024 +0530 [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions (#10872) Co-authored-by: Y Ethan Guo --- .../hudi/utilities/sources/JsonKafkaSource.java | 18 -- .../hudi/utilities/streamer/HoodieStreamerUtils.java | 20 2 files changed, 16 insertions(+), 22 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java index 71f0c4db3f1..a8f70e7c854 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java @@ -21,6 +21,8 @@ package org.apache.hudi.utilities.sources; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.collection.CloseableMappingIterator; import org.apache.hudi.utilities.UtilHelpers; import org.apache.hudi.utilities.config.JsonKafkaPostProcessorConfig; import org.apache.hudi.utilities.exception.HoodieSourcePostProcessException; @@ -43,8 +45,6 @@ import org.apache.spark.streaming.kafka010.LocationStrategies; import org.apache.spark.streaming.kafka010.OffsetRange; import java.io.IOException; -import java.util.LinkedList; -import java.util.List; import static org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys; import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_KEY_COLUMN; @@ -80,28 +80,26 @@ public class JsonKafkaSource extends KafkaSource> { return postProcess(maybeAppendKafkaOffsets(kafkaRDD)); } - protected JavaRDD maybeAppendKafkaOffsets(JavaRDD> kafkaRDD) { + protected JavaRDD maybeAppendKafkaOffsets(JavaRDD> kafkaRDD) { if (this.shouldAddOffsets) { return kafkaRDD.mapPartitions(partitionIterator -> { -List stringList = new LinkedList<>(); -ObjectMapper om = new ObjectMapper(); -partitionIterator.forEachRemaining(consumerRecord -> { +ObjectMapper objectMapper = new ObjectMapper(); +return new CloseableMappingIterator<>(ClosableIterator.wrap(partitionIterator), consumerRecord -> { String recordValue = consumerRecord.value().toString(); String recordKey = StringUtils.objToString(consumerRecord.key()); try { -ObjectNode jsonNode = (ObjectNode) om.readTree(recordValue); +ObjectNode jsonNode = (ObjectNode) objectMapper.readTree(recordValue); jsonNode.put(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset()); jsonNode.put(KAFKA_SOURCE_PARTITION_COLUMN, consumerRecord.partition()); jsonNode.put(KAFKA_SOURCE_TIMESTAMP_COLUMN, consumerRecord.timestamp()); if (recordKey != null) { jsonNode.put(KAFKA_SOURCE_KEY_COLUMN, recordKey); } -stringList.add(om.writeValueAsString(jsonNode)); +return objectMapper.writeValueAsString(jsonNode); } catch (Throwable e) { -stringList.add(recordValue); +return recordValue; } }); -return stringList.iterator(); }); } return kafkaRDD.map(consumerRecord -> (String) consumerRecord.value()); diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java index 2ecf0b02fb6..3be64fefbb3 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamerUtils.java @@ -31,6 +31,7 @@ import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.model.HoodieRecordPayload; import org.apache.hudi.common.model.HoodieSparkRecord; import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.util.ConfigUtils; import org.apache.hudi.common.util.Either; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.ClosableIterator; @@ -55,10 +56,8 @@ import org.apache.spark.sql.avro.HoodieAvroDeserializer; import
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
yihua merged PR #10872: URL: https://github.com/apache/hudi/pull/10872 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
jonvex commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597341038 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java: ## @@ -43,39 +40,18 @@ public class HoodieFileWriterFactory { - private static HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType) { Review Comment: Not exactly sure what you want -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]
boneanxs commented on issue #11188: URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105500024 > > I actually see hudi could set many spark relate configures in SparkConf, most of them are related to parquet reader/writer. > > Are these options configurable? Yes, these configures could be set by users -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7745] Move Hadoop-dependent util methods to hudi-hadoop-common module [hudi]
hudi-bot commented on PR #11193: URL: https://github.com/apache/hudi/pull/11193#issuecomment-2105497625 ## CI report: * bbbc714d35283e5e743883ae945cfec50f99b226 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23846) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]
hudi-bot commented on PR #11152: URL: https://github.com/apache/hudi/pull/11152#issuecomment-2105497588 ## CI report: * 3de4b581b5acf16cc256b7d2cce1a43cbd166b28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23831) * a5daf71906886e6d8da62abdf2decae1e20b09ef Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23845) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7745] Move Hadoop-dependent util methods to hudi-hadoop-common module [hudi]
hudi-bot commented on PR #11193: URL: https://github.com/apache/hudi/pull/11193#issuecomment-2105473886 ## CI report: * bbbc714d35283e5e743883ae945cfec50f99b226 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]
hudi-bot commented on PR #11152: URL: https://github.com/apache/hudi/pull/11152#issuecomment-2105472543 ## CI report: * 3de4b581b5acf16cc256b7d2cce1a43cbd166b28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23831) * a5daf71906886e6d8da62abdf2decae1e20b09ef UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105473597 ## CI report: * 68d1d5f75238863f544937e050f0f3015f8d7df8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23837) * 5739605de6bb73d0e3982a335e243fdb356a6031 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23844) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7746) HadoopConf loses set values when HoodieStorage.getConf is called
Jonathan Vexler created HUDI-7746: - Summary: HadoopConf loses set values when HoodieStorage.getConf is called Key: HUDI-7746 URL: https://issues.apache.org/jira/browse/HUDI-7746 Project: Apache Hudi Issue Type: Improvement Components: reader-core, writer-core Reporter: Jonathan Vexler We use StorageConf to hold hoodie.io.factory.class which is the IOFactory class that should be used for the file reader and writer factories. For now, we have added reflection into HoodieHadoopIOFactory to get around this, but ideally we should not need to do this -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105463029 ## CI report: * 68d1d5f75238863f544937e050f0f3015f8d7df8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23837) * 5739605de6bb73d0e3982a335e243fdb356a6031 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2105462852 ## CI report: * acbabdc64da321e77aaabd03bcd9d5f3c322c0ec Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23841) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Status: Patch Available (was: In Progress) > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7745: - Labels: hoodie-storage pull-request-available (was: hoodie-storage) > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7745] Move Hadoop-dependent util methods to hudi-hadoop-common module [hudi]
yihua opened a new pull request, #11193: URL: https://github.com/apache/hudi/pull/11193 ### Change Logs This PR moves Hadoop-dependent util methods in `hudi-common` module to `hudi-hadoop-common` module: - Util methods in `FSUtils` class are moved to `HadoopFSUtils` class - Util methods in `FileStatusUtils` class are moved to `HadoopFSUtils` class - Util methods in `ConfigUtils` class are moved to `HadoopConfigUtils` class ### Impact Towards making `hudi-common` module Hadoop-indepedent. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]
hudi-bot commented on PR #10763: URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105440971 ## CI report: * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Story Points: 2 (was: 0.5) > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Status: In Progress (was: Open) > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Sprint: Sprint 2023-04-26 > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7739) Shudown asyncDetectorExecutor in AsyncTimelineServerBasedDetectionStrategy
[ https://issues.apache.org/jira/browse/HUDI-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7739. Fix Version/s: 1.0.0 Assignee: Danny Chen Resolution: Fixed Fixed via master branch: 86f7a6554df17ba558428be7c8db6316160a0c82 > Shudown asyncDetectorExecutor in AsyncTimelineServerBasedDetectionStrategy > -- > > Key: HUDI-7739 > URL: https://issues.apache.org/jira/browse/HUDI-7739 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Xinyu Zou >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7739] Shudown asyncDetectorExecutor in AsyncTimelineServerBasedDetectionStrategy [hudi]
danny0405 merged PR #11182: URL: https://github.com/apache/hudi/pull/11182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (c7c636c2d18 -> 86f7a6554df)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from c7c636c2d18 [HUDI-7731] Fix usage of new Configuration() in production code (#11191) add 86f7a6554df [HUDI-7739] Shudown asyncDetectorExecutor in AsyncTimelineServerBasedDetectionStrategy (#11182) No new revisions were added by this update. Summary of changes: .../conflict/detection/TimelineServerBasedDetectionStrategy.java | 2 ++ .../java/org/apache/hudi/timeline/service/RequestHandler.java| 9 +++-- .../org/apache/hudi/timeline/service/handlers/MarkerHandler.java | 3 +++ .../marker/AsyncTimelineServerBasedDetectionStrategy.java| 6 ++ 4 files changed, 18 insertions(+), 2 deletions(-)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Epic Link: HUDI-6243 > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7745: --- Assignee: Ethan Guo > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Fix Version/s: 0.15.0 1.0.0 > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Story Points: 0.5 > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
[ https://issues.apache.org/jira/browse/HUDI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7745: Labels: hoodie-storage (was: ) > Move Hadoop-dependent util methods to hudi-hadoop-common > > > Key: HUDI-7745 > URL: https://issues.apache.org/jira/browse/HUDI-7745 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7686) Add util methods for type cast of configuration instances
[ https://issues.apache.org/jira/browse/HUDI-7686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7686. --- Resolution: Fixed > Add util methods for type cast of configuration instances > - > > Key: HUDI-7686 > URL: https://issues.apache.org/jira/browse/HUDI-7686 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > Original Estimate: 0h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7686) Add util methods for type cast of configuration instances
[ https://issues.apache.org/jira/browse/HUDI-7686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7686: Remaining Estimate: 0h Original Estimate: 0h > Add util methods for type cast of configuration instances > - > > Key: HUDI-7686 > URL: https://issues.apache.org/jira/browse/HUDI-7686 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > Original Estimate: 0h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7592) Remove remaining hadoop usage in hudi-common module
[ https://issues.apache.org/jira/browse/HUDI-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7592: --- Assignee: Ethan Guo (was: Jonathan Vexler) > Remove remaining hadoop usage in hudi-common module > --- > > Key: HUDI-7592 > URL: https://issues.apache.org/jira/browse/HUDI-7592 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7731) Fix usage of new Configuration() in production code
[ https://issues.apache.org/jira/browse/HUDI-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7731. --- Resolution: Fixed > Fix usage of new Configuration() in production code > --- > > Key: HUDI-7731 > URL: https://issues.apache.org/jira/browse/HUDI-7731 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > new Configuration() is used in non-test code in several places: > HoodieParquetDataBlock.java > Metrics.java > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7731) Fix usage of new Configuration() in production code
[ https://issues.apache.org/jira/browse/HUDI-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7731: Story Points: 2 > Fix usage of new Configuration() in production code > --- > > Key: HUDI-7731 > URL: https://issues.apache.org/jira/browse/HUDI-7731 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > new Configuration() is used in non-test code in several places: > HoodieParquetDataBlock.java > Metrics.java > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7726) Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils
[ https://issues.apache.org/jira/browse/HUDI-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7726. --- Resolution: Fixed > Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils > -- > > Key: HUDI-7726 > URL: https://issues.apache.org/jira/browse/HUDI-7726 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2105414177 ## CI report: * ac7713c64afa1d2406463c8563a065362c95ecda Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23640) * acbabdc64da321e77aaabd03bcd9d5f3c322c0ec Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23841) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]
hudi-bot commented on PR #10763: URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105414125 ## CI report: * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804) * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23840) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7731] Fix usage of new Configuration() in production code (#11191)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c7c636c2d18 [HUDI-7731] Fix usage of new Configuration() in production code (#11191) c7c636c2d18 is described below commit c7c636c2d18673a41aa0e656b6c7746808d4a001 Author: Jon Vexler AuthorDate: Fri May 10 20:47:33 2024 -0400 [HUDI-7731] Fix usage of new Configuration() in production code (#11191) Co-authored-by: Jonathan Vexler <=> --- .../main/java/org/apache/hudi/client/BaseHoodieClient.java | 2 +- .../apache/hudi/client/timeline/HoodieTimelineArchiver.java | 2 +- .../apache/hudi/client/transaction/lock/LockManager.java| 2 +- .../client/transaction/lock/metrics/HoodieLockMetrics.java | 5 +++-- .../main/java/org/apache/hudi/metrics/HoodieMetrics.java| 5 +++-- .../table/action/compact/RunCompactionActionExecutor.java | 2 +- .../hudi/table/action/index/RunIndexActionExecutor.java | 2 +- .../org/apache/hudi/metrics/TestHoodieConsoleMetrics.java | 5 - .../org/apache/hudi/metrics/TestHoodieGraphiteMetrics.java | 5 - .../java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java | 5 - .../java/org/apache/hudi/metrics/TestHoodieMetrics.java | 5 - .../hudi/metrics/datadog/TestDatadogMetricsReporter.java| 9 ++--- .../test/java/org/apache/hudi/metrics/m3/TestM3Metrics.java | 10 +++--- .../hudi/metrics/prometheus/TestPrometheusReporter.java | 7 +-- .../hudi/metrics/prometheus/TestPushGateWayReporter.java| 13 - .../hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java | 2 +- .../hudi/metadata/JavaHoodieBackedTableMetadataWriter.java | 2 +- .../apache/hudi/client/TestJavaHoodieBackedMetadata.java| 2 +- .../hudi/client/validator/SparkPreCommitValidator.java | 2 +- .../hudi/metadata/SparkHoodieBackedTableMetadataWriter.java | 2 +- .../hudi/client/functional/TestHoodieBackedMetadata.java| 2 +- .../java/org/apache/hudi/io/TestHoodieTimelineArchiver.java | 2 +- .../apache/hudi/common/table/log/HoodieLogFormatWriter.java | 2 +- .../hudi/common/table/log/block/HoodieAvroDataBlock.java| 3 ++- .../hudi/common/table/log/block/HoodieCommandBlock.java | 3 ++- .../hudi/common/table/log/block/HoodieCorruptBlock.java | 3 ++- .../apache/hudi/common/table/log/block/HoodieDataBlock.java | 7 --- .../hudi/common/table/log/block/HoodieDeleteBlock.java | 3 ++- .../hudi/common/table/log/block/HoodieHFileDataBlock.java | 4 ++-- .../apache/hudi/common/table/log/block/HoodieLogBlock.java | 2 +- .../hudi/common/table/log/block/HoodieParquetDataBlock.java | 7 ++- .../java/org/apache/hudi/metadata/BaseTableMetadata.java| 3 ++- .../org/apache/hudi/metadata/HoodieMetadataMetrics.java | 5 +++-- .../src/main/java/org/apache/hudi/metrics/Metrics.java | 12 +++- .../apache/hudi/common/functional/TestHoodieLogFormat.java | 2 +- .../hudi/common/table/log/block/TestHoodieDeleteBlock.java | 3 ++- .../procedures/RepairOverwriteHoodiePropsProcedure.scala| 2 +- .../marker/MarkerBasedEarlyConflictDetectionRunnable.java | 6 ++ .../utilities/deltastreamer/HoodieDeltaStreamerMetrics.java | 9 + .../hudi/utilities/ingestion/HoodieIngestionMetrics.java| 10 +++--- .../hudi/utilities/streamer/HoodieStreamerMetrics.java | 11 ++- .../java/org/apache/hudi/utilities/streamer/StreamSync.java | 8 ++-- 42 files changed, 120 insertions(+), 78 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java index fe964db6862..f982a0e4e22 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java @@ -102,7 +102,7 @@ public abstract class BaseHoodieClient implements Serializable, AutoCloseable { this.heartbeatClient = new HoodieHeartbeatClient(storage, this.basePath, clientConfig.getHoodieClientHeartbeatIntervalInMs(), clientConfig.getHoodieClientHeartbeatTolerableMisses()); -this.metrics = new HoodieMetrics(config); +this.metrics = new HoodieMetrics(config, context.getStorageConf()); this.txnManager = new TransactionManager(config, storage); this.timeGenerator = TimeGenerators.getTimeGenerator( config.getTimeGeneratorConfig(), HadoopFSUtils.getStorageConf(hadoopConf)); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java index 175ac5607f4..f4ab6c76e13
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
yihua merged PR #11191: URL: https://github.com/apache/hudi/pull/11191 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
hudi-bot commented on PR #10872: URL: https://github.com/apache/hudi/pull/10872#issuecomment-2105411384 ## CI report: * ac7713c64afa1d2406463c8563a065362c95ecda Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23640) * acbabdc64da321e77aaabd03bcd9d5f3c322c0ec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]
hudi-bot commented on PR #10763: URL: https://github.com/apache/hudi/pull/10763#issuecomment-2105411335 ## CI report: * 34ffbbc913fab393871b866160ea2a7e1b38c53f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23804) * 6a4b8370ab41ce9060dcd8c7c4ee80786cc086b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105408982 ## CI report: * 3a85e2b008420a061db66f6946e37234f67dd7ec Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23838) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7508] Avoid collecting records in HoodieStreamerUtils.createHoodieRecords and JsonKafkaSource mapPartitions [hudi]
yihua commented on code in PR #10872: URL: https://github.com/apache/hudi/pull/10872#discussion_r1597306796 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java: ## @@ -57,6 +57,8 @@ */ public class JsonKafkaSource extends KafkaSource { + private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper(); Review Comment: Good point, I think we should revert this change, because anyway the `ObjectMapper` is serde from Spark driver to executor if using the static object, so there's not much gain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (badcca2ebe8 -> 0d0e27e2b9b)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from badcca2ebe8 [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module (#11190) add 0d0e27e2b9b [HUDI-7673] Fixing false positive validation failure for RLI with MDT validation tool (#11098) No new revisions were added by this update. Summary of changes: .../utilities/HoodieMetadataTableValidator.java| 118 ++--- .../TestHoodieMetadataTableValidator.java | 117 +++- 2 files changed, 194 insertions(+), 41 deletions(-)
Re: [PR] [HUDI-7673] Fixing false positive validation failure for RLI with MDT validation tool [hudi]
yihua commented on code in PR #11098: URL: https://github.com/apache/hudi/pull/11098#discussion_r1597304231 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -1034,6 +1018,60 @@ private void validateRecordIndexContent(HoodieSparkEngineContext sparkEngineCont } } + @VisibleForTesting + JavaPairRDD> getRecordLocationsFromFSBasedListing(HoodieSparkEngineContext sparkEngineContext, + String basePath, + String latestCompletedCommit) { +return sparkEngineContext.getSqlContext().read().format("hudi") +.option(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT().key(), latestCompletedCommit) +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, FILENAME_METADATA_FIELD) +.toJavaRDD() +.mapToPair(row -> new Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)), + Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)), + FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD)) +.cache(); + } + + @VisibleForTesting + JavaPairRDD> getRecordLocationsFromRLI(HoodieSparkEngineContext sparkEngineContext, + String basePath, + String latestCompletedCommit) { +return sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) Review Comment: @nsivabalan one thing we can consider as a follow-up is to use the time-travel query on MDT as well (this might not be supported but would be good to have for the validation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7673] Fixing false positive validation failure for RLI with MDT validation tool [hudi]
yihua merged PR #11098: URL: https://github.com/apache/hudi/pull/11098 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]
danny0405 commented on issue #11188: URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105384268 > I actually see hudi could set many spark relate configures in SparkConf, most of them are related to parquet reader/writer. Are these options configurable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7745) Move Hadoop-dependent util methods to hudi-hadoop-common
Ethan Guo created HUDI-7745: --- Summary: Move Hadoop-dependent util methods to hudi-hadoop-common Key: HUDI-7745 URL: https://issues.apache.org/jira/browse/HUDI-7745 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7742) Move Hadoop-dependent reader util classes to hudi-hadoop-common module
[ https://issues.apache.org/jira/browse/HUDI-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7742. --- Resolution: Fixed > Move Hadoop-dependent reader util classes to hudi-hadoop-common module > -- > > Key: HUDI-7742 > URL: https://issues.apache.org/jira/browse/HUDI-7742 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7742) Move Hadoop-dependent reader util classes to hudi-hadoop-common module
[ https://issues.apache.org/jira/browse/HUDI-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7742: Sprint: Sprint 2023-04-26 > Move Hadoop-dependent reader util classes to hudi-hadoop-common module > -- > > Key: HUDI-7742 > URL: https://issues.apache.org/jira/browse/HUDI-7742 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7744) Create HoodieIOFactory and config to set it
[ https://issues.apache.org/jira/browse/HUDI-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7744: Status: In Progress (was: Open) > Create HoodieIOFactory and config to set it > --- > > Key: HUDI-7744 > URL: https://issues.apache.org/jira/browse/HUDI-7744 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Create HoodieIOFactory that will give the appropriate reader and writer > factories based on a config. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7743) Fix simple mistakes with StoragePath in production code.
[ https://issues.apache.org/jira/browse/HUDI-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7743: Sprint: Sprint 2023-04-26 > Fix simple mistakes with StoragePath in production code. > > > Key: HUDI-7743 > URL: https://issues.apache.org/jira/browse/HUDI-7743 > Project: Apache Hudi > Issue Type: Task > Components: code-quality >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Fix many simple mistakes with StoragePath such as doing extra conversions, > not using util methods etc. > Don't fix any mistakes in tests for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7742) Move Hadoop-dependent reader util classes to hudi-hadoop-common module
[ https://issues.apache.org/jira/browse/HUDI-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7742: Status: Patch Available (was: In Progress) > Move Hadoop-dependent reader util classes to hudi-hadoop-common module > -- > > Key: HUDI-7742 > URL: https://issues.apache.org/jira/browse/HUDI-7742 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7744) Create HoodieIOFactory and config to set it
[ https://issues.apache.org/jira/browse/HUDI-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7744: Sprint: Sprint 2023-04-26 > Create HoodieIOFactory and config to set it > --- > > Key: HUDI-7744 > URL: https://issues.apache.org/jira/browse/HUDI-7744 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Create HoodieIOFactory that will give the appropriate reader and writer > factories based on a config. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7744) Create HoodieIOFactory and config to set it
[ https://issues.apache.org/jira/browse/HUDI-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7744: Status: Patch Available (was: In Progress) > Create HoodieIOFactory and config to set it > --- > > Key: HUDI-7744 > URL: https://issues.apache.org/jira/browse/HUDI-7744 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Create HoodieIOFactory that will give the appropriate reader and writer > factories based on a config. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
yihua commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597287731 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkConfUtils.scala: ## @@ -48,4 +50,10 @@ object HoodieSparkConfUtils { .map(HollowCommitHandling.valueOf) .getOrElse(HollowCommitHandling.valueOf(INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT.defaultValue)) } + + def getSparkReaderConfig(): HoodieConfig = { Review Comment: Is this still needed? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -853,7 +852,7 @@ object HoodieBaseRelation extends SparkAdapterSupport { val hoodieConfig = new HoodieConfig() hoodieConfig.setValue(USE_NATIVE_HFILE_READER, options.getOrElse(USE_NATIVE_HFILE_READER.key(), USE_NATIVE_HFILE_READER.defaultValue().toString)) - val reader = HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO) + val reader = (new HoodieSparkIOFactory).getReaderFactory(HoodieRecordType.AVRO) Review Comment: Similar here. If the IO factory class name is already set in the storage config, could we use the reflection, i.e., `HoodieIOFactory.getIOFactory(conf)`, to load the `HoodieSparkIOFactory`, which achieves the same behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
yihua commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597285676 ## hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/HoodieHadoopIOFactory.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.io.hadoop.HoodieAvroFileReaderFactory; +import org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory; + +public class HoodieHadoopIOFactory extends HoodieIOFactory { Review Comment: Javadocs on what is returned, and what the Avro vs Spark record type means. ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java: ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.config.HoodieStorageConfig; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.storage.StorageConfiguration; + +/** + * Base class to get HoodieFileReaderFactory and HoodieFileWriterFactory + */ +public abstract class HoodieIOFactory { + + public static HoodieIOFactory getIOFactory(StorageConfiguration storageConf) { +String ioFactoryClass = storageConf.getString(HoodieStorageConfig.HOODIE_IO_FACTORY_CLASS.key()) +.orElse(HoodieStorageConfig.HOODIE_IO_FACTORY_CLASS.defaultValue()); +return getIOFactory(ioFactoryClass); + } + + private static HoodieIOFactory getIOFactory(String ioFactoryClass) { +try { + Class clazz = + ReflectionUtils.getClass(ioFactoryClass); + return (HoodieIOFactory) clazz.newInstance(); +} catch (IllegalArgumentException | IllegalAccessException | InstantiationException e) { + throw new HoodieException("Unable to create " + ioFactoryClass, e); +} + } + + public HoodieFileReaderFactory getReaderFactory(HoodieRecord.HoodieRecordType recordType) { Review Comment: I'm wondering for different record type, should they be using two different `HoodieIOFactory` implementation instead one implementation class redirecting to different reader/writer factories internally? ## hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/HoodieHadoopIOFactory.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.io.hadoop.HoodieAvroFileReaderFactory; +import
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
yihua commented on code in PR #11192: URL: https://github.com/apache/hudi/pull/11192#discussion_r1597281157 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java: ## @@ -43,39 +40,18 @@ public class HoodieFileWriterFactory { - private static HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType) { Review Comment: Let's create a JIRA ticket to make the `HoodieFileReaderFactory` and `HoodieFileWriterFactory` interface or abstract class so APIs to create new reader and writer should be abstract. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkIOFactory.java: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.model.HoodieRecord; + +public class HoodieSparkIOFactory extends HoodieHadoopIOFactory { + Review Comment: nit: empty line ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java: ## @@ -235,6 +235,13 @@ public class HoodieStorageConfig extends HoodieConfig { + "and it is loaded at runtime. This is only required when trying to " + "override the existing write context when `hoodie.datasource.write.row.writer.enable=true`."); + public static final ConfigProperty HOODIE_IO_FACTORY_CLASS = ConfigProperty + .key("hoodie.io.factory.class") + .defaultValue("org.apache.hudi.io.storage.HoodieHadoopIOFactory") + .markAdvanced() + .sinceVersion("0.15.0") + .withDocumentation("Provided class should implement `org.apache.hudi.io.storage.HoodieIOFactory`"); Review Comment: ```suggestion .withDocumentation("The fully-qualified class name of the factory class to return readers and writers of files used by Hudi. The provided class should implement `org.apache.hudi.io.storage.HoodieIOFactory`"); ``` ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java: ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hudi.common.config.HoodieStorageConfig; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.storage.StorageConfiguration; + +/** + * Base class to get HoodieFileReaderFactory and HoodieFileWriterFactory + */ +public abstract class HoodieIOFactory { + + public static HoodieIOFactory getIOFactory(StorageConfiguration storageConf) { +String ioFactoryClass = storageConf.getString(HoodieStorageConfig.HOODIE_IO_FACTORY_CLASS.key()) +.orElse(HoodieStorageConfig.HOODIE_IO_FACTORY_CLASS.defaultValue()); +return getIOFactory(ioFactoryClass); + } + + private static HoodieIOFactory getIOFactory(String ioFactoryClass) { +try { + Class clazz = + ReflectionUtils.getClass(ioFactoryClass); + return (HoodieIOFactory) clazz.newInstance(); +} catch (IllegalArgumentException | IllegalAccessException | InstantiationException e) { + throw new HoodieException("Unable to create " + ioFactoryClass, e); +} Review Comment: Use `ReflectionUtils#loadClass` ##
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105351756 ## CI report: * ab418b95d057737b34fe1314e550bee213e1d2b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23835) * 3a85e2b008420a061db66f6946e37234f67dd7ec Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23838) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105347053 ## CI report: * 68d1d5f75238863f544937e050f0f3015f8d7df8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23837) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105347026 ## CI report: * ab418b95d057737b34fe1314e550bee213e1d2b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23835) * 3a85e2b008420a061db66f6946e37234f67dd7ec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105310136 ## CI report: * 68d1d5f75238863f544937e050f0f3015f8d7df8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23837) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7743] Fix Simple Mistakes with StoragePath [hudi]
yihua commented on code in PR #11189: URL: https://github.com/apache/hudi/pull/11189#discussion_r1597244371 ## hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java: ## @@ -123,7 +121,7 @@ public String addPartitionMeta( client.getActiveTimeline().getCommitTimeline().lastInstant().get().getTimestamp(); List partitionPaths = FSUtils.getAllPartitionFoldersThreeLevelsDown(HoodieCLI.storage, client.getBasePath()); -StoragePath basePath = new StoragePath(client.getBasePath()); +StoragePath basePath = client.getBasePathV2(); Review Comment: Could you create a JIRA to remove `getBasePathV2()` and return `StoragePath` from `getBasePath()` as a follow-up? ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -294,11 +294,20 @@ public HoodieTableType getTableType() { /** * @return Meta path + * @deprecated please use {@link #getMetaPathV2()} */ + @Deprecated Review Comment: Can we directly change this method to return `StoragePath` and `metaPath`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
hudi-bot commented on PR #11192: URL: https://github.com/apache/hudi/pull/11192#issuecomment-2105303945 ## CI report: * 68d1d5f75238863f544937e050f0f3015f8d7df8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module [hudi]
yihua merged PR #11190: URL: https://github.com/apache/hudi/pull/11190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module [hudi]
yihua commented on code in PR #11190: URL: https://github.com/apache/hudi/pull/11190#discussion_r1597239746 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -121,7 +122,7 @@ protected byte[] serializeRecords(List records) throws IOException HFileContext context = new HFileContextBuilder() .withBlockSize(DEFAULT_BLOCK_SIZE) .withCompression(compressionAlgorithm.get()) -.withCellComparator(new HoodieHBaseKVComparator()) + .withCellComparator(ReflectionUtils.loadClass(KV_COMPARATOR_CLASS_NAME)) Review Comment: Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module (#11190)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new badcca2ebe8 [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module (#11190) badcca2ebe8 is described below commit badcca2ebe8c30efa3fc13cad4c3f0114101874a Author: Y Ethan Guo AuthorDate: Fri May 10 14:20:00 2024 -0700 [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module (#11190) --- .../action/bootstrap/OrcBootstrapMetadataHandler.java | 2 +- .../common/table/log/block/HoodieHFileDataBlock.java| 5 +++-- .../hudi/common/testutils/HoodieTestDataGenerator.java | 4 .../java/org/apache/hudi/common/util/AvroOrcUtils.java | 0 .../main/java/org/apache/hudi/common/util/OrcUtils.java | 1 + .../org/apache/hudi/io/hadoop/HoodieAvroOrcReader.java | 1 - .../org/apache/hudi/io/hadoop}/OrcReaderIterator.java | 17 ++--- .../apache/hudi/io/storage/HoodieHBaseKVComparator.java | 0 .../parquet/avro/HoodieAvroParquetReaderBuilder.java| 0 .../org/apache/parquet/avro/HoodieAvroReadSupport.java | 0 .../org/apache/hudi/common/util/TestAvroOrcUtils.java | 4 .../apache/hudi/io/hadoop}/TestOrcReaderIterator.java | 17 ++--- .../org/apache/hudi/functional/TestOrcBootstrap.java| 2 +- .../deltastreamer/HoodieDeltaStreamerTestBase.java | 3 ++- .../hudi/utilities/testutils/UtilitiesTestBase.java | 3 ++- 15 files changed, 34 insertions(+), 25 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/OrcBootstrapMetadataHandler.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/OrcBootstrapMetadataHandler.java index 2d4457d575b..86944ae3f5b 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/OrcBootstrapMetadataHandler.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/OrcBootstrapMetadataHandler.java @@ -25,11 +25,11 @@ import org.apache.hudi.common.model.HoodieKey; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType; import org.apache.hudi.common.util.AvroOrcUtils; -import org.apache.hudi.common.util.OrcReaderIterator; import org.apache.hudi.common.util.queue.HoodieExecutor; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.io.HoodieBootstrapHandle; +import org.apache.hudi.io.hadoop.OrcReaderIterator; import org.apache.hudi.keygen.KeyGeneratorInterface; import org.apache.hudi.storage.StoragePath; import org.apache.hudi.table.HoodieTable; diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java index a379e305d0e..0893637b956 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java @@ -26,6 +26,7 @@ import org.apache.hudi.common.model.HoodieFileFormat; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.collection.ClosableIterator; import org.apache.hudi.common.util.collection.CloseableMappingIterator; import org.apache.hudi.exception.HoodieIOException; @@ -33,7 +34,6 @@ import org.apache.hudi.io.SeekableDataInputStream; import org.apache.hudi.io.storage.HoodieAvroHFileReaderImplBase; import org.apache.hudi.io.storage.HoodieFileReader; import org.apache.hudi.io.storage.HoodieFileReaderFactory; -import org.apache.hudi.io.storage.HoodieHBaseKVComparator; import org.apache.hudi.storage.HoodieStorage; import org.apache.hudi.storage.HoodieStorageUtils; import org.apache.hudi.storage.StorageConfiguration; @@ -76,6 +76,7 @@ import static org.apache.hudi.common.util.ValidationUtils.checkState; public class HoodieHFileDataBlock extends HoodieDataBlock { private static final Logger LOG = LoggerFactory.getLogger(HoodieHFileDataBlock.class); private static final int DEFAULT_BLOCK_SIZE = 1024 * 1024; + private static final String KV_COMPARATOR_CLASS_NAME = "org.apache.hudi.io.storage.HoodieHBaseKVComparator"; private final Option compressionAlgorithm; // This path is used for constructing HFile reader context, which should not be @@ -121,7 +122,7 @@ public class HoodieHFileDataBlock extends HoodieDataBlock { HFileContext context = new HFileContextBuilder() .withBlockSize(DEFAULT_BLOCK_SIZE)
(hudi) branch master updated: [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils (#11185)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 23b283acf3e [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils (#11185) 23b283acf3e is described below commit 23b283acf3e4c30e26652edf9c710e17e47951c5 Author: Jon Vexler AuthorDate: Fri May 10 17:19:23 2024 -0400 [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils (#11185) Co-authored-by: Jonathan Vexler <=> Co-authored-by: Y Ethan Guo --- .../hudi/cli/commands/HoodieLogFileCommand.java| 15 +-- .../hudi/io/HoodieKeyLocationFetchHandle.java | 7 +- .../hudi/client/TestJavaHoodieBackedMetadata.java | 12 +- .../testutils/HoodieJavaClientTestHarness.java | 10 +- .../functional/TestHoodieBackedMetadata.java | 12 +- .../functional/TestHoodieBackedTableMetadata.java | 7 +- .../hudi/common/model/HoodiePartitionMetadata.java | 2 +- .../hudi/common/table/TableSchemaResolver.java | 122 +++ .../org/apache/hudi/common/util/BaseFileUtils.java | 11 +- .../hudi/table/catalog/TableOptionProperties.java | 4 +- .../common/table/ParquetTableSchemaResolver.java | 66 +++ .../org/apache/hudi/common/util/HFileUtils.java| 130 + .../hudi/common/table/TestTableSchemaResolver.java | 7 +- .../ShowHoodieLogFileMetadataProcedure.scala | 3 +- .../ShowHoodieLogFileRecordsProcedure.scala| 9 +- .../apache/hudi/sync/common/HoodieSyncClient.java | 6 +- .../utilities/HoodieMetadataTableValidator.java| 8 +- 17 files changed, 259 insertions(+), 172 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java index 367dc2302ee..d3c30143072 100644 --- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java @@ -49,8 +49,6 @@ import org.apache.hudi.storage.StoragePath; import com.fasterxml.jackson.databind.ObjectMapper; import org.apache.avro.Schema; import org.apache.avro.generic.IndexedRecord; -import org.apache.parquet.avro.AvroSchemaConverter; -import org.apache.parquet.schema.MessageType; import org.springframework.shell.standard.ShellComponent; import org.springframework.shell.standard.ShellMethod; import org.springframework.shell.standard.ShellOption; @@ -109,9 +107,7 @@ public class HoodieLogFileCommand { } else { fileName = path.getName(); } - MessageType schema = TableSchemaResolver.readSchemaFromLogFile(storage, path); - Schema writerSchema = schema != null - ? new AvroSchemaConverter().convert(Objects.requireNonNull(schema)) : null; + Schema writerSchema = TableSchemaResolver.readSchemaFromLogFile(storage, path); try (Reader reader = HoodieLogFormat.newReader(storage, new HoodieLogFile(path), writerSchema)) { // read the avro blocks @@ -213,14 +209,13 @@ public class HoodieLogFileCommand { checkArgument(logFilePaths.size() > 0, "There is no log file"); // TODO : readerSchema can change across blocks/log files, fix this inside Scanner -AvroSchemaConverter converter = new AvroSchemaConverter(); Schema readerSchema = null; // get schema from last log file for (int i = logFilePaths.size() - 1; i >= 0; i--) { - MessageType schema = TableSchemaResolver.readSchemaFromLogFile( + Schema schema = TableSchemaResolver.readSchemaFromLogFile( storage, new StoragePath(logFilePaths.get(i))); if (schema != null) { -readerSchema = converter.convert(schema); +readerSchema = schema; break; } } @@ -257,10 +252,8 @@ public class HoodieLogFileCommand { } } else { for (String logFile : logFilePaths) { -MessageType schema = TableSchemaResolver.readSchemaFromLogFile( +Schema writerSchema = TableSchemaResolver.readSchemaFromLogFile( client.getStorage(), new StoragePath(logFile)); -Schema writerSchema = schema != null -? new AvroSchemaConverter().convert(Objects.requireNonNull(schema)) : null; try (HoodieLogFormat.Reader reader = HoodieLogFormat.newReader(storage, new HoodieLogFile(new StoragePath(logFile)), writerSchema)) { // read the avro blocks diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java index 30e2437485e..f05a0af3449 100644 ---
Re: [PR] [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils [hudi]
yihua merged PR #11185: URL: https://github.com/apache/hudi/pull/11185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils [hudi]
yihua commented on code in PR #11185: URL: https://github.com/apache/hudi/pull/11185#discussion_r1597236327 ## hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java: ## @@ -300,21 +273,6 @@ private Option getTableParquetSchemaFromDataFile() { } } - public static MessageType convertAvroSchemaToParquet(Schema schema, Configuration hadoopConf) { Review Comment: Our functional tests cover a few schema evolution cases that execute the logic in `TableSchemaResolver`. Still, we should do more testing before the release to make sure everything still works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7744) Create HoodieIOFactory and config to set it
[ https://issues.apache.org/jira/browse/HUDI-7744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7744: - Labels: pull-request-available (was: ) > Create HoodieIOFactory and config to set it > --- > > Key: HUDI-7744 > URL: https://issues.apache.org/jira/browse/HUDI-7744 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core, writer-core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Create HoodieIOFactory that will give the appropriate reader and writer > factories based on a config. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7744] Introduce IOFactory and a config to set the factory [hudi]
jonvex opened a new pull request, #11192: URL: https://github.com/apache/hudi/pull/11192 ### Change Logs Remove the base static methods in reader and writer factory to create them. Reader and Writer factories will be created by HoodieIOFactory.getIOFactory() In spark modules, we will directly use the spark io factory. ### Impact io factory for different file systems possible now ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7744) Create HoodieIOFactory and config to set it
Jonathan Vexler created HUDI-7744: - Summary: Create HoodieIOFactory and config to set it Key: HUDI-7744 URL: https://issues.apache.org/jira/browse/HUDI-7744 Project: Apache Hudi Issue Type: Improvement Components: reader-core, writer-core Reporter: Jonathan Vexler Assignee: Jonathan Vexler Fix For: 0.15.0, 1.0.0 Create HoodieIOFactory that will give the appropriate reader and writer factories based on a config. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105237449 ## CI report: * ab418b95d057737b34fe1314e550bee213e1d2b0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23835) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105174009 ## CI report: * b8b11fa3ea8a47d89621b3c130e1bbb8066d7c4c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23834) * ab418b95d057737b34fe1314e550bee213e1d2b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23835) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105162966 ## CI report: * b8b11fa3ea8a47d89621b3c130e1bbb8066d7c4c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23834) * ab418b95d057737b34fe1314e550bee213e1d2b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7742] Move Hadoop-dependent reader util classes to hudi-hadoop-common module [hudi]
hudi-bot commented on PR #11190: URL: https://github.com/apache/hudi/pull/11190#issuecomment-2105162884 ## CI report: * 2132f3e951ec684176e7ce6aefa3c8c467849dab Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23833) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7726] Restructure TableSchemaResolver to separate Hadoop logic and use BaseFileUtils [hudi]
hudi-bot commented on PR #11185: URL: https://github.com/apache/hudi/pull/11185#issuecomment-2105162785 ## CI report: * 76dc076a65432684c5217f12c264edb7cd50d9e9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23832) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]
hudi-bot commented on PR #11152: URL: https://github.com/apache/hudi/pull/11152#issuecomment-2105162615 ## CI report: * 3de4b581b5acf16cc256b7d2cce1a43cbd166b28 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23831) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105105997 ## CI report: * b8b11fa3ea8a47d89621b3c130e1bbb8066d7c4c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23834) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7731] Fix usage of new Configuration() in production code [hudi]
hudi-bot commented on PR #11191: URL: https://github.com/apache/hudi/pull/11191#issuecomment-2105096635 ## CI report: * b8b11fa3ea8a47d89621b3c130e1bbb8066d7c4c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7743] Fix Simple Mistakes with StoragePath [hudi]
hudi-bot commented on PR #11189: URL: https://github.com/apache/hudi/pull/11189#issuecomment-2105087264 ## CI report: * 975a7d92617080bb4c32e832796e8d13cd8d9857 UNKNOWN * 51a199199691df091162a3d8cb71f9ee448b079a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23829) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7731) Fix usage of new Configuration() in production code
[ https://issues.apache.org/jira/browse/HUDI-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7731: -- Status: Patch Available (was: In Progress) > Fix usage of new Configuration() in production code > --- > > Key: HUDI-7731 > URL: https://issues.apache.org/jira/browse/HUDI-7731 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > new Configuration() is used in non-test code in several places: > HoodieParquetDataBlock.java > Metrics.java > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7731) Fix usage of new Configuration() in production code
[ https://issues.apache.org/jira/browse/HUDI-7731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7731: - Labels: pull-request-available (was: ) > Fix usage of new Configuration() in production code > --- > > Key: HUDI-7731 > URL: https://issues.apache.org/jira/browse/HUDI-7731 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > new Configuration() is used in non-test code in several places: > HoodieParquetDataBlock.java > Metrics.java > -- This message was sent by Atlassian Jira (v8.20.10#820010)