[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219064378 ## CI report: * 72188dc38211a6a540256f168412d03e1cb86765 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797) * 76094582239330262bac8c06a78b59d8abf2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10801) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219061939 ## CI report: * 72188dc38211a6a540256f168412d03e1cb86765 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797) * 76094582239330262bac8c06a78b59d8abf2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar
hudi-bot commented on PR #6386: URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219059120 ## CI report: * 79ae2b395759dcc104cb5c95834f8494026ff04c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10788) * 513b7d8c4d6209f62bd0075de4cf1a07113b0fb9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10800) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron closed pull request #6264: [HUDI-4503] support for parsing identifier with catalog
YannByron closed pull request #6264: [HUDI-4503] support for parsing identifier with catalog URL: https://github.com/apache/hudi/pull/6264 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar
hudi-bot commented on PR #6386: URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219056857 ## CI report: * 79ae2b395759dcc104cb5c95834f8494026ff04c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10788) * 513b7d8c4d6209f62bd0075de4cf1a07113b0fb9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled
hudi-bot commented on PR #6429: URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219054346 ## CI report: * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10798) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table
[ https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4586: - Labels: pull-request-available (was: ) > Address S3 timeouts in Bloom Index with metadata table > -- > > Key: HUDI-4586 > URL: https://issues.apache.org/jira/browse/HUDI-4586 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-08-15 at 17.39.01.png > > > For partitioned table, there are significant number of S3 requests timeout > causing the upserts to fail when using Bloom Index with metadata table. > {code:java} > Load meta index key ranges for file slices: hudi > collect at HoodieSparkEngineContext.java:137+details > org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) > org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137) > org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213) > org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145) > org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123) > org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89) > org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49) > org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32) > org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53) > org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45) > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113) > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97) > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155) > org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329) > org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183) > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > {code} > {code:java} > org.apache.hudi.exception.HoodieException: Exception when reading log file > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196) > at > org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209) > at > org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253) > at > org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224) > at > org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) > at
[GitHub] [hudi] yihua opened a new pull request, #6432: [WIP][HUDI-4586] Improve metadata fetching in bloom index
yihua opened a new pull request, #6432: URL: https://github.com/apache/hudi/pull/6432 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar
alexeykudinkin commented on PR #6386: URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219045573 CI test failure are unrelated (Flink ITs failing) https://user-images.githubusercontent.com/428277/185299067-e45900fd-3bf7-40ad-9c86-f0b6dd221c11.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar
alexeykudinkin commented on code in PR #6386: URL: https://github.com/apache/hudi/pull/6386#discussion_r948653875 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/PulsarSource.java: ## @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources; + +import org.apache.hudi.DataSourceUtils; +import org.apache.hudi.HoodieConversionUtils; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.util.Lazy; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.pulsar.client.api.Consumer; +import org.apache.pulsar.client.api.MessageId; +import org.apache.pulsar.client.api.PulsarClient; +import org.apache.pulsar.client.api.PulsarClientException; +import org.apache.pulsar.client.api.SubscriptionInitialPosition; +import org.apache.pulsar.client.api.SubscriptionType; +import org.apache.pulsar.client.impl.PulsarClientImpl; +import org.apache.pulsar.common.naming.TopicName; +import org.apache.pulsar.shade.io.netty.channel.EventLoopGroup; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.pulsar.JsonUtils; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; + +import static org.apache.hudi.common.util.ThreadUtils.collectActiveThreads; + +/** + * Source fetching data from Pulsar topics + */ +public class PulsarSource extends RowSource implements Closeable { + + private static final Logger LOG = LogManager.getLogger(PulsarSource.class); + + private static final String HUDI_PULSAR_CONSUMER_ID_FORMAT = "hudi-pulsar-consumer-%d"; + private static final String[] PULSAR_META_FIELDS = new String[]{ + "__key", + "__topic", + "__messageId", + "__publishTime", + "__eventTime", + "__messageProperties" + }; + + private final String topicName; + + private final String serviceEndpointURL; + private final String adminEndpointURL; + + // NOTE: We're keeping the client so that we can shut it down properly + private final Lazy pulsarClient; + private final Lazy> pulsarConsumer; + + public PulsarSource(TypedProperties props, + JavaSparkContext sparkContext, + SparkSession sparkSession, + SchemaProvider schemaProvider) { +super(props, sparkContext, sparkSession, schemaProvider); + +DataSourceUtils.checkRequiredProperties(props, +Arrays.asList( +Config.PULSAR_SOURCE_TOPIC_NAME.key(), +Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key())); + +// Converting to a descriptor allows us to canonicalize the topic's name properly +this.topicName = TopicName.get(props.getString(Config.PULSAR_SOURCE_TOPIC_NAME.key())).toString(); + +// TODO validate endpoints provided in the appropriate format +this.serviceEndpointURL = props.getString(Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key()); +this.adminEndpointURL = props.getString(Config.PULSAR_SOURCE_ADMIN_ENDPOINT_URL.key()); + +this.pulsarClient = Lazy.lazily(this::initPulsarClient); +this.pulsarConsumer = Lazy.lazily(this::subscribeToTopic); + } + + @Override + protected Pair>, String> fetchNextBatch(Option lastCheckpointStr, long sourceLimit) { +Pair startingEndingOffsetsPair = computeOffsets(lastCheckpointStr, sourceLimit); + +MessageId startingOffset = startingEndingOffsetsPair.getLeft(); +MessageId endingOffset = startingEndingOffsetsPair.getRight(); + +String startingOffsetStr = convertToOffsetString(topicName, startingOffset); +String endingOffsetStr = convertToOffsetString(topicName, endingOffset); + +Dataset sourceRows = sparkSession.read() +.format("pulsar") +
[jira] [Assigned] (HUDI-4549) hive sync bundle causes class loader issue
[ https://issues.apache.org/jira/browse/HUDI-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-4549: - Assignee: Sagar Sumit > hive sync bundle causes class loader issue > -- > > Key: HUDI-4549 > URL: https://issues.apache.org/jira/browse/HUDI-4549 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies >Reporter: Raymond Xu >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.12.1 > > > A weird classpath issue i found: when testing deltastreamer using > hudi-utilities-slim-bundle, if i put --jars > hudi-hive-sync-bundle.jar,hudi-spark-bundle.jar then i’ll get this error when > writing > {code:java} > Caused by: java.lang.NoSuchMethodError: > org.apache.hudi.avro.MercifulJsonConverter.convert(Ljava/lang/String;Lorg/apache/avro/Schema;)Lorg/apache/avro/generic/GenericRecord; > at > org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJson(AvroConvertor.java:86) > at > org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) > {code} > if i put the spark bundle before the hive sync bundle, then no issue. Without > hive-sync-bundle, also no issue. So hive-sync-bundle somehow messes up with > classpath? not sure why it reports a hudi-common API not found… caused by > shading avro? > the same behavior i observed with aws-bundle, which makes sense, as it’s a > superset of hive-sync-bundle -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range
[ https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-4528: - Assignee: Sagar Sumit > Diff tool to compare metadata across snapshots in a given time range > > > Key: HUDI-4528 > URL: https://issues.apache.org/jira/browse/HUDI-4528 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > > A tool that diffs two snapshots at table and partition level and can give > info about what new file ids got created, deleted, updated and track other > changes that are captured in write stats. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range
[ https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4528: -- Sprint: 2022/08/22 > Diff tool to compare metadata across snapshots in a given time range > > > Key: HUDI-4528 > URL: https://issues.apache.org/jira/browse/HUDI-4528 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > > A tool that diffs two snapshots at table and partition level and can give > info about what new file ids got created, deleted, updated and track other > changes that are captured in write stats. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails
[ https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4626: -- Sprint: 2022/08/22 > Partitioning table by `_hoodie_partition_path` fails > > > Key: HUDI-4626 > URL: https://issues.apache.org/jira/browse/HUDI-4626 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Priority: Blocker > > > Currently, creating a table partitioned by "_hoodie_partition_path" using > Glue catalog fails w/ the following exception: > {code:java} > AnalysisException: Found duplicate column(s) in the data schema and the > partition schema: _hoodie_partition_path > {code} > Using following DDL: > {code:java} > CREATE EXTERNAL TABLE `active_storage_attachments`( `_hoodie_commit_time` > string COMMENT '', `_hoodie_commit_seqno` string COMMENT '', > `_hoodie_record_key` string COMMENT '', `_hoodie_file_name` string COMMENT > '', `_change_operation_type` string COMMENT '', > `_upstream_event_processed_ts_ms` bigint COMMENT '', > `db_shard_source_partition` string COMMENT '', `_event_origin_ts_ms` bigint > COMMENT '', `_event_tx_id` bigint COMMENT '', `_event_lsn` bigint COMMENT > '', `_event_xmin` bigint COMMENT '', `id` bigint COMMENT '', `name` > string COMMENT '', `record_type` string COMMENT '', `record_id` bigint > COMMENT '', `blob_id` bigint COMMENT '', `created_at` timestamp COMMENT > '')PARTITIONED BY ( `_hoodie_partition_path` string COMMENT '')ROW FORMAT > SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH > SERDEPROPERTIES ( 'hoodie.query.as.ro.table'='false', 'path'='...') > STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION > '...' > TBLPROPERTIES ( 'spark.sql.sources.provider'='hudi' ) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled
hudi-bot commented on PR #6429: URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219027206 ## CI report: * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10798) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes
hudi-bot commented on PR #6431: URL: https://github.com/apache/hudi/pull/6431#issuecomment-1219027215 ## CI report: * 7d755ad923370aa4ac6b46e75725883ee1434d23 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10799) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes
hudi-bot commented on PR #6431: URL: https://github.com/apache/hudi/pull/6431#issuecomment-1219025349 ## CI report: * 7d755ad923370aa4ac6b46e75725883ee1434d23 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled
hudi-bot commented on PR #6429: URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219025328 ## CI report: * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0
codope commented on code in PR #6417: URL: https://github.com/apache/hudi/pull/6417#discussion_r948635820 ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write Review Comment: Sets the context very nicely. Sounds much better. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0
codope commented on code in PR #6417: URL: https://github.com/apache/hudi/pull/6417#discussion_r948635421 ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write +configuration. This unlocks new opportunities for Hudi users. For example, one can retain commits for years, by adding +one savepoint per day for older commits (say > 30 days old). And they can query hudi using `as.of.instant` semantics for +old data. In previous versions, one would have to retain every commit and let archival stop at the first commit. + +:::note +However, if this feature is enabled, restore cannot be supported. This limitation would be relaxed in a future release +and the development of this feature can be tracked in [HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500). +::: + +### Deltastreamer Termination Strategy + +Users can now configure a post-write termination strategy with deltastreamer `continuous` mode if need be. For instance, +users can configure graceful shutdown if there is no new data from source for 5 consecutive times. Here is the interface +for the termination strategy. +```java +/** + * Post write termination strategy for deltastreamer in continuous mode. + */ +public interface PostWriteTerminationStrategy { + + /** + * Returns whether deltastreamer needs to be shutdown. + * @param scheduledCompactionInstantAndWriteStatuses optional pair of scheduled compaction instant and write statuses. + * @return true if deltastreamer has to be shutdown. false otherwise. + */ + boolean shouldShutdown(Option, JavaRDD>> scheduledCompactionInstantAndWriteStatuses); + +} +``` + +Also, this might help in bootstrapping a new table. Instead of doing one bulk load or bulk_insert leveraging a large +cluster for a large input of data, one could start deltastreamer on continuous mode and add a shutdown strategy to +terminate, once all data has been bootstrapped. This way, each batch could be smaller and may not need a large cluster +to bootstrap data. We have one concrete implementation out of the box, [NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java). +Users can feel free to implement their own strategy as they see fit. + +### Spark 3.3 Support + +Spark 3.3 support is added; users who are on Spark 3.3 can use `hudi-spark3.3-bundle` or `hudi-spark3-bundle`. Spark 3.2, +Spark 3.1 and Spark 2.4 will continue to be supported. Please check the migration guide for [bundle updates](#bundle-updates). + +### Spark SQL Support Improvements + +- Support for upgrade, downgrade, bootstrap, clean, rollback and repair through `Call Procedure` command. +- Support for `analyze table`. +- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL. + +### Flink 1.15 Support + +Flink 1.15.x is integrated with Hudi, use profile param `-Pflink1.15` when compiling the codes to adapt the version. +Alternatively, use `hudi-flink1.15-bundle`. Flink 1.14 and Flink 1.13 will continue to be supported. Please check the +migration guide for [bundle updates](#bundle-updates). + +### Flink Integration Improvements + +- **Data skipping** is supported for batch mode read, set up SQL option `metadata.enabled`, `hoodie.metadata.index.column.stats.enable` and `read.data.skipping.enabled` as true to enable it. +- A **HMS-based Flink catalog** is added with catalog identifier as `hudi`. You can instantiate the catalog through API directly or use the `CREATE CATALOG` syntax to create it. Specifies catalog option `'mode' = 'hms'` to switch to the HMS catalog. By default, the catalog is in `dfs` mode. +- **Async clustering** is supported for Flink `INSERT` operation, set up SQL option `clustering.schedule.enabled` and `clustering.async.enabled` as true to enable it. When enabling this feature, a clustering sub-pipeline is scheduled asynchronously continuously to merge the small files continuously into larger ones. + +### Performance Improvements + +This version brings more improvements to make Hudi the most performant lake storage format. Some notable improvements are: +- Closed the
[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0
codope commented on code in PR #6417: URL: https://github.com/apache/hudi/pull/6417#discussion_r948635344 ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write +configuration. This unlocks new opportunities for Hudi users. For example, one can retain commits for years, by adding +one savepoint per day for older commits (say > 30 days old). And they can query hudi using `as.of.instant` semantics for +old data. In previous versions, one would have to retain every commit and let archival stop at the first commit. + +:::note +However, if this feature is enabled, restore cannot be supported. This limitation would be relaxed in a future release +and the development of this feature can be tracked in [HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500). +::: + +### Deltastreamer Termination Strategy + +Users can now configure a post-write termination strategy with deltastreamer `continuous` mode if need be. For instance, +users can configure graceful shutdown if there is no new data from source for 5 consecutive times. Here is the interface +for the termination strategy. +```java +/** + * Post write termination strategy for deltastreamer in continuous mode. + */ +public interface PostWriteTerminationStrategy { + + /** + * Returns whether deltastreamer needs to be shutdown. + * @param scheduledCompactionInstantAndWriteStatuses optional pair of scheduled compaction instant and write statuses. + * @return true if deltastreamer has to be shutdown. false otherwise. + */ + boolean shouldShutdown(Option, JavaRDD>> scheduledCompactionInstantAndWriteStatuses); + +} +``` + +Also, this might help in bootstrapping a new table. Instead of doing one bulk load or bulk_insert leveraging a large +cluster for a large input of data, one could start deltastreamer on continuous mode and add a shutdown strategy to +terminate, once all data has been bootstrapped. This way, each batch could be smaller and may not need a large cluster +to bootstrap data. We have one concrete implementation out of the box, [NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java). +Users can feel free to implement their own strategy as they see fit. + +### Spark 3.3 Support + +Spark 3.3 support is added; users who are on Spark 3.3 can use `hudi-spark3.3-bundle` or `hudi-spark3-bundle`. Spark 3.2, +Spark 3.1 and Spark 2.4 will continue to be supported. Please check the migration guide for [bundle updates](#bundle-updates). + +### Spark SQL Support Improvements + +- Support for upgrade, downgrade, bootstrap, clean, rollback and repair through `Call Procedure` command. +- Support for `analyze table`. +- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL. + +### Flink 1.15 Support + +Flink 1.15.x is integrated with Hudi, use profile param `-Pflink1.15` when compiling the codes to adapt the version. +Alternatively, use `hudi-flink1.15-bundle`. Flink 1.14 and Flink 1.13 will continue to be supported. Please check the +migration guide for [bundle updates](#bundle-updates). + +### Flink Integration Improvements + +- **Data skipping** is supported for batch mode read, set up SQL option `metadata.enabled`, `hoodie.metadata.index.column.stats.enable` and `read.data.skipping.enabled` as true to enable it. +- A **HMS-based Flink catalog** is added with catalog identifier as `hudi`. You can instantiate the catalog through API directly or use the `CREATE CATALOG` syntax to create it. Specifies catalog option `'mode' = 'hms'` to switch to the HMS catalog. By default, the catalog is in `dfs` mode. +- **Async clustering** is supported for Flink `INSERT` operation, set up SQL option `clustering.schedule.enabled` and `clustering.async.enabled` as true to enable it. When enabling this feature, a clustering sub-pipeline is scheduled asynchronously continuously to merge the small files continuously into larger ones. + +### Performance Improvements + +This version brings more improvements to make Hudi the most performant lake storage format. Some notable improvements are: +- Closed the
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219023427 ## CI report: * 72188dc38211a6a540256f168412d03e1cb86765 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] junyuc25 opened a new pull request, #6431: Shutdown CloudWatch reporter when query completes
junyuc25 opened a new pull request, #6431: URL: https://github.com/apache/hudi/pull/6431 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] QuChunhe opened a new issue, #6430: [SUPPORT]Flink SQL can't read complex type data Java client write
QuChunhe opened a new issue, #6430: URL: https://github.com/apache/hudi/issues/6430 1. Java client write complex type data, such as ARRAY>2. private String baseFileFormat = "parquet"; Path path = new Path(tablePath); FileSystem fs = FSUtils.getFs(tablePath, hadoopConf); String schema = Util.getStringFromResource("/" + databasePrefix + "." + tableName + ".schema"); if (!fs.exists(path)) { HoodieTableMetaClient.withPropertyBuilder() .setBaseFileFormat(baseFileFormat) .setPartitionFields(partitionFields) .setHiveStylePartitioningEnable(false) .setTableCreateSchema(schema) .setTableType(HoodieTableType.MERGE_ON_READ) .setRecordKeyFields(recordKeyFields) .setPayloadClassName(DefaultHoodieRecordPayload.class.getName()) .setTableName(tableName) .initTable(hadoopConf, tablePath); } // Create the write client to write some records in HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder() .withPath(tablePath) .withAutoCommit(true) .withEmbeddedTimelineServerEnabled(false) .withRollbackUsingMarkers(false) .withBulkInsertParallelism(parallelism) .withSchema(schema) //.withSchemaEvolutionEnable(true) .withParallelism(parallelism, parallelism) .withDeleteParallelism(1) //.withEngineType(EngineType.SPARK) .forTable(tableName) .withMetadataConfig( HoodieMetadataConfig.newBuilder() .enable(true) .build()) .withConsistencyGuardConfig( ConsistencyGuardConfig.newBuilder() .withEnableOptimisticConsistencyGuard(false) .build()) .withIndexConfig( HoodieIndexConfig.newBuilder() .withIndexType(IndexType.BLOOM) .build()) .withCompactionConfig( HoodieCompactionConfig.newBuilder() .withAutoArchive(true) .withAutoClean(true) .withCompactionLazyBlockReadEnabled(true) .withAsyncClean(true) .build()) .build(); 2. flink sql can not write the data, and throws the following errors java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:659) ~[?:1.8.0_332] at java.util.ArrayList.get(ArrayList.java:435) ~[?:1.8.0_332] at org.apache.parquet.schema.GroupType.getType(GroupType.java:216) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:514) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:480) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:406) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:216) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:156) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:147) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.mor.MergeOnReadInputFormat.getReader(MergeOnReadInputFormat.java:306) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:177) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:81) ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0] at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84) ~[flink-dist_2.12-1.13.6.jar:1.13.6] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:104) ~[flink-dist_2.12-1.13.6.jar:1.13.6] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:60) ~[flink-dist_2.12-1.13.6.jar:1.13.6] at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:269) ~[flink-dist_2.12-1.13.6.jar:1.13.6] 2022-08-18 11:43:18,280 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Clearing
[jira] [Updated] (HUDI-4636) Output preCombine fields of delete records when changelog disabled
[ https://issues.apache.org/jira/browse/HUDI-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4636: - Labels: pull-request-available (was: ) > Output preCombine fields of delete records when changelog disabled > > > Key: HUDI-4636 > URL: https://issues.apache.org/jira/browse/HUDI-4636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: yonghua jian >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] flashJd opened a new pull request, #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled
flashJd opened a new pull request, #6429: URL: https://github.com/apache/hudi/pull/6429 ### Change Logs when changelog disabled, flink will output pk columns only, https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java We need the preCombine and partition fields also, so pull this request. ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4636) Output preCombine fields of delete records when changelog disabled
yonghua jian created HUDI-4636: -- Summary: Output preCombine fields of delete records when changelog disabled Key: HUDI-4636 URL: https://issues.apache.org/jira/browse/HUDI-4636 Project: Apache Hudi Issue Type: Improvement Reporter: yonghua jian -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on issue #6422: [SUPPORT]: hudi build failing for hudi-flink-client when no maven build option is provided
danny0405 commented on issue #6422: URL: https://github.com/apache/hudi/issues/6422#issuecomment-1219012429 What version did you use, maybe it is caused by the force activation of the profile which i have removed recently from the master code: https://github.com/apache/hudi/pull/6415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219002712 ## CI report: * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796) * 72188dc38211a6a540256f168412d03e1cb86765 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219000957 ## CI report: * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796) * 72188dc38211a6a540256f168412d03e1cb86765 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218994371 ## CI report: * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled
flashJd commented on code in PR #6385: URL: https://github.com/apache/hudi/pull/6385#discussion_r948606837 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,20 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} + +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); Review Comment: @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled
flashJd commented on code in PR #6385: URL: https://github.com/apache/hudi/pull/6385#discussion_r948604910 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,20 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} + +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); Review Comment: > So why we configure a `COMPLEX` key generator while the key is just simple here? Because flink-sql's default logic is COMPLEX KeyGenerator, when boolean complexHoodieKey = pks.length > 1 || partitions.length > 1; https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java#L239 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled
danny0405 commented on code in PR #6385: URL: https://github.com/apache/hudi/pull/6385#discussion_r948601779 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,20 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} + +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); Review Comment: So why we configure a `COMPLEX` key generator while the key is just simple here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled
flashJd commented on code in PR #6385: URL: https://github.com/apache/hudi/pull/6385#discussion_r948600424 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,20 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} + +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); Review Comment: @danny0405 I've explained it, looking forward to your reply. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhilinli123 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync
zhilinli123 commented on PR #3771: URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218940639 > It should be, we have supported the option `hive_sync.conf.dir` to support custom hive configurations, you can declare your kerb conf in the target dir thank you ! It would be nice to have a reference document -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
TengHuo commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1218935335 @yihua Ohhh so sorry for late reply, too many issues recently, forgot this one, let me check it today. Thanks for remind -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync
danny0405 commented on PR #3771: URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218927034 > @test-wangxiaoyu @codope Will the new version support this feature It should be, we have supported the option `hive_sync.conf.dir` to support custom hive configurations, you can declare your kerb conf in the target dir `hive-site.xml` file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on pull request #4913: [HUDI-1517] create marker file for every log file
XuQianJin-Stars commented on PR #4913: URL: https://github.com/apache/hudi/pull/4913#issuecomment-1218917104 hi @guanziyue can rebase this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhilinli123 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync
zhilinli123 commented on PR #3771: URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218909566 @test-wangxiaoyu @codope Will the new version support this feature -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes
hudi-bot commented on PR #6427: URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218906064 ## CI report: * 4a37120edf3a454a802c4ef63916b61b5547ab72 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10795) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns
rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns URL: https://github.com/apache/hudi/issues/6335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rohit-m-99 commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns
rohit-m-99 commented on issue #6335: URL: https://github.com/apache/hudi/issues/6335#issuecomment-1218892631 Option 2 worked for me! Set hoodie.metadata.enable to false in Deltastreamer and wait for a few commits so that metadata table is deleted completely (no .hoodie/metadata folder), and then re-enable the metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218814176 ## CI report: * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791) * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218805623 ## CI report: * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791) * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar
nsivabalan commented on code in PR #6386: URL: https://github.com/apache/hudi/pull/6386#discussion_r948550727 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/PulsarSource.java: ## @@ -0,0 +1,292 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.sources; + +import org.apache.hudi.DataSourceUtils; +import org.apache.hudi.HoodieConversionUtils; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.util.Lazy; +import org.apache.hudi.utilities.schema.SchemaProvider; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.pulsar.client.api.Consumer; +import org.apache.pulsar.client.api.MessageId; +import org.apache.pulsar.client.api.PulsarClient; +import org.apache.pulsar.client.api.PulsarClientException; +import org.apache.pulsar.client.api.SubscriptionInitialPosition; +import org.apache.pulsar.client.api.SubscriptionType; +import org.apache.pulsar.client.impl.PulsarClientImpl; +import org.apache.pulsar.common.naming.TopicName; +import org.apache.pulsar.shade.io.netty.channel.EventLoopGroup; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.pulsar.JsonUtils; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; + +import static org.apache.hudi.common.util.ThreadUtils.collectActiveThreads; + +/** + * Source fetching data from Pulsar topics + */ +public class PulsarSource extends RowSource implements Closeable { + + private static final Logger LOG = LogManager.getLogger(PulsarSource.class); + + private static final String HUDI_PULSAR_CONSUMER_ID_FORMAT = "hudi-pulsar-consumer-%d"; + private static final String[] PULSAR_META_FIELDS = new String[]{ + "__key", + "__topic", + "__messageId", + "__publishTime", + "__eventTime", + "__messageProperties" + }; + + private final String topicName; + + private final String serviceEndpointURL; + private final String adminEndpointURL; + + // NOTE: We're keeping the client so that we can shut it down properly + private final Lazy pulsarClient; + private final Lazy> pulsarConsumer; + + public PulsarSource(TypedProperties props, + JavaSparkContext sparkContext, + SparkSession sparkSession, + SchemaProvider schemaProvider) { +super(props, sparkContext, sparkSession, schemaProvider); + +DataSourceUtils.checkRequiredProperties(props, +Arrays.asList( +Config.PULSAR_SOURCE_TOPIC_NAME.key(), +Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key())); + +// Converting to a descriptor allows us to canonicalize the topic's name properly +this.topicName = TopicName.get(props.getString(Config.PULSAR_SOURCE_TOPIC_NAME.key())).toString(); + +// TODO validate endpoints provided in the appropriate format +this.serviceEndpointURL = props.getString(Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key()); +this.adminEndpointURL = props.getString(Config.PULSAR_SOURCE_ADMIN_ENDPOINT_URL.key()); + +this.pulsarClient = Lazy.lazily(this::initPulsarClient); +this.pulsarConsumer = Lazy.lazily(this::subscribeToTopic); + } + + @Override + protected Pair>, String> fetchNextBatch(Option lastCheckpointStr, long sourceLimit) { +Pair startingEndingOffsetsPair = computeOffsets(lastCheckpointStr, sourceLimit); + +MessageId startingOffset = startingEndingOffsetsPair.getLeft(); +MessageId endingOffset = startingEndingOffsetsPair.getRight(); + +String startingOffsetStr = convertToOffsetString(topicName, startingOffset); +String endingOffsetStr = convertToOffsetString(topicName, endingOffset); + +Dataset sourceRows = sparkSession.read() +.format("pulsar") +
[GitHub] [hudi] bithw1 commented on issue #6425: [SUPPORT]Writing to MOR table seems not working as expected
bithw1 commented on issue #6425: URL: https://github.com/apache/hudi/issues/6425#issuecomment-1218713061 Thanks @nsivabalan for the helpful answer. When I first write to the hudi table, there are only parquet files there(**insert only**), and when I rerun the application(**update only**), log files appear. So, the reason should be the following: ``` if your workload will never have any updates, you may not see log files at all. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 closed issue #6425: [SUPPORT]Writing to MOR table seems not working as expected
bithw1 closed issue #6425: [SUPPORT]Writing to MOR table seems not working as expected URL: https://github.com/apache/hudi/issues/6425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 commented on issue #6379: [SUPPORT]What's the reading behavior for MOR table?
bithw1 commented on issue #6379: URL: https://github.com/apache/hudi/issues/6379#issuecomment-1218706302 > yes, you are right. Thanks @nsivabalan for the confirmation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 closed issue #6379: [SUPPORT]What's the reading behavior for MOR table?
bithw1 closed issue #6379: [SUPPORT]What's the reading behavior for MOR table? URL: https://github.com/apache/hudi/issues/6379 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dyang108 opened a new issue, #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated
dyang108 opened a new issue, #6428: URL: https://github.com/apache/hudi/issues/6428 **Describe the problem you faced** Deltastreamer with write output to S3 exits unexpectedly when running in continuous mode. It seems **To Reproduce** Steps to reproduce the behavior: I ran the following: ``` /etc/spark/bin/spark-submit --conf -Dconfig.file=/service.conf,spark.executor.extraJavaOptions=-Dlog4j.debug=true --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars /etc/spark/work-dir/* /etc/spark/work-dir/hudi-utilities-bundle_2.11-0.12.0.jar --props /mnt/mesos/sandbox/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --target-base-path "s3a://strava.scratch/tmp/derick/hudi" --target-table "aligned_activities" --op "UPSERT" --source-ordering-field "ts" --table-type "COPY_ON_WRITE" --source-limit 100 --continuous ``` the /etc/spark/work-dir/ looks like this: aws-java-sdk-bundle-1.12.283.jar hadoop-aws-2.6.5.jar hudi-utilities-bundle_2.11-0.12.0.jar scala-library-2.11.12.jar spark-streaming-kafka-0-10_2.11-2.4.8.jar **Expected behavior** I don't expect there to be issues on compaction here. **Environment Description** * Hudi version : 0.12.0 (also tried 0.11.1) * Spark version : 2.4.8 * Hive version : * Hadoop version : 2.6.5 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : Yes, docker on Mesos I'm reading from an Avro kafka topic **Additional context** Add any other context about the problem here. Reading Avro record from Kafka ``` hoodie.datasource.write.recordkey.field=activityId auto.offset.reset=latest ``` **Stacktrace** ``` 22/08/17 23:07:26 ERROR HoodieAsyncService: Service shutdown with error java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220817230714888 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:190) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:187) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220817230714888 at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64) at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155) at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:588) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:687) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes
hudi-bot commented on PR #6427: URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218684100 ## CI report: * 4a37120edf3a454a802c4ef63916b61b5547ab72 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10795) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4635) Update roadmap page based on H2 2022 plan
Ethan Guo created HUDI-4635: --- Summary: Update roadmap page based on H2 2022 plan Key: HUDI-4635 URL: https://issues.apache.org/jira/browse/HUDI-4635 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan
[ https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4635: Priority: Blocker (was: Major) > Update roadmap page based on H2 2022 plan > - > > Key: HUDI-4635 > URL: https://issues.apache.org/jira/browse/HUDI-4635 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4635) Update roadmap page based on H2 2022 plan
[ https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-4635: --- Assignee: Ethan Guo > Update roadmap page based on H2 2022 plan > - > > Key: HUDI-4635 > URL: https://issues.apache.org/jira/browse/HUDI-4635 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan
[ https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4635: Component/s: docs > Update roadmap page based on H2 2022 plan > - > > Key: HUDI-4635 > URL: https://issues.apache.org/jira/browse/HUDI-4635 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Priority: Major > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan
[ https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4635: Fix Version/s: 0.13.0 > Update roadmap page based on H2 2022 plan > - > > Key: HUDI-4635 > URL: https://issues.apache.org/jira/browse/HUDI-4635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes
hudi-bot commented on PR #6427: URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218676855 ## CI report: * 4a37120edf3a454a802c4ef63916b61b5547ab72 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert
parisni commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218625262 Yeah KEEP_LATEST_COMMITS. Since cleaning never find files to delete it always fallback into getPartitionPathsForFullCleaning(). But that method looks for path on disk, however it then looks for filegroup to delete in metadata table . Also I guess there is a problem to use incremental cleaning together with KEEP_LATEST_COMMITS which lead to never clean some partitions after a first clean but I will open a separate issue for this one. Incremental cleaning shall be use together withKEEP_LATEST_FILE_VERSIONS only On August 17, 2022 10:45:32 PM UTC, Sivabalan Narayanan ***@***.***> wrote: >may I know what cleaning policy you are using? I see that for KEEP_LATEST_FILE_VERSIONS, we call getPartitionPathsForFullCleaning() within which we use file system based listing and not metadata table based listing. > >and if you are using KEEP_LATEST_COMMITS, within incremental clean mode enabled, if there is no prior clean ever, we trigger getPartitionPathsForFullCleaning() (within which we use file system based listing and not metadata table based listing). > >If not for these, we should be hitting only metadata based listing. Can you confirm which one among the above is your case. > > >-- >Reply to this email directly or view it on GitHub: >https://github.com/apache/hudi/issues/6373#issuecomment-1218565541 >You are receiving this because you authored the thread. > >Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #6427: [MINOR] Improve code style of CLI Command classes
yihua opened a new pull request, #6427: URL: https://github.com/apache/hudi/pull/6427 ### Change Logs As above for code style changes only. There is no logic change. ### Impact **Risk level: none** Only code reformatting. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert
nsivabalan commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218565541 may I know what cleaning policy you are using? I see that for KEEP_LATEST_FILE_VERSIONS, we call getPartitionPathsForFullCleaning() within which we use file system based listing and not metadata table based listing. and if you are using KEEP_LATEST_COMMITS, within incremental clean mode enabled, if there is no prior clean ever, we trigger getPartitionPathsForFullCleaning() (within which we use file system based listing and not metadata table based listing). If not for these, we should be hitting only metadata based listing. Can you confirm which one among the above is your case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert
parisni commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218559061 > But as you can imagine, this is going to result in huge no of file groups in general and puts lot of pressure on the system Do you mean pressure when cleaning or pressure when reading or in general ? Also insert produces same number of file groups since iam in a case of append only table with no new data in the past. Anyway cleaning is much faster w/o metadata table and that would help to allow to specify configure cleaning to work on disk only On August 17, 2022 9:59:31 PM UTC, Sivabalan Narayanan ***@***.***> wrote: >wrt bulk_insert, I understand cleaning is not going to be any use. coz, every new commit goes into new file groups. Hence there won't be any file groups which will have more file slices which might be eligible for cleaning. But as you can imagine, this is going to result in huge no of file groups in general and puts lot of pressure on the system. > > >-- >Reply to this email directly or view it on GitHub: >https://github.com/apache/hudi/issues/6373#issuecomment-1218528854 >You are receiving this because you authored the thread. > >Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert
nsivabalan commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218558261 I am looking into the slowness in cleaning. will keep you posted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert
nsivabalan commented on issue #6373: URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218528854 wrt bulk_insert, I understand cleaning is not going to be any use. coz, every new commit goes into new file groups. Hence there won't be any file groups which will have more file slices which might be eligible for cleaning. But as you can imagine, this is going to result in huge no of file groups in general and puts lot of pressure on the system. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6379: [SUPPORT]What's the reading behavior for MOR table?
nsivabalan commented on issue #6379: URL: https://github.com/apache/hudi/issues/6379#issuecomment-1218526892 yes, you are right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6425: [SUPPORT]Writing to MOR table seems not working as expected
nsivabalan commented on issue #6425: URL: https://github.com/apache/hudi/issues/6425#issuecomment-1218526557 Looks like you are setting max commits to trigger compaction to 1. And so, after every delta commit, compaction is triggering and converting parquet + log files into new parquet files. Another reason could be, if your workload will never have any updates, you may not see log files at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218514404 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10794) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218504183 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10794) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218459979 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218448062 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218444100 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6345: [HUDI-4552]: RFC-58: Integrate column stats index with all query engines
alexeykudinkin commented on code in PR #6345: URL: https://github.com/apache/hudi/pull/6345#discussion_r948358504 ## rfc/rfc-58/rfc-58.md: ## @@ -0,0 +1,69 @@ + +# RFC-58: Integrate column stats index with all query engines + + + +## Proposers + +- @pratyakshsharma + +## Approvers Review Comment: @prasannarajaperumal please add me to this forum as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
the-other-tim-brown commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218430020 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni commented on issue #6342: [SUPPORT] Reconcile schema fails when multiple fields missing
parisni commented on issue #6342: URL: https://github.com/apache/hudi/issues/6342#issuecomment-1218417408 Also after more testing, the issue is not with multiple columns missing but when the last column is missing. On August 15, 2022 5:42:26 PM UTC, Sivabalan Narayanan ***@***.***> wrote: >reconcilation of schemas work only if missing columns are in the end. > >For eg: >Commit1: >Schema: col1, col2, col3, col4 > >Commit2: >Schema: col1, col2, col3, col4, col5, col6 > >Commit3: old writer. >Schema: col1, col2, col3, col4 >Commit3 will succeed when you enable reconcile schema config. > >In general, adding new colunmns at the end of existing schema works. it may break if you try to add new columns inbetween. for eg, in commit2, if your schema as col1, col5, col6, col2, col3, col4, your write or read may break. > > > > >-- >Reply to this email directly or view it on GitHub: >https://github.com/apache/hudi/issues/6342#issuecomment-1215479286 >You are receiving this because you authored the thread. > >Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni commented on issue #6342: [SUPPORT] Reconcile schema fails when multiple fields missing
parisni commented on issue #6342: URL: https://github.com/apache/hudi/issues/6342#issuecomment-1218406180 > reconcilation of schemas work only if missing columns are in the end @nsivabalan Do you mean https://hudi.apache.org/docs/configurations#hoodiedatasourcewritereconcileschema ? That's not the behavior I see: i'am able to write data with missing column at any position, until only one is missing. Also the doc don't mention such limitation nor its source code. @xiarixiaoyao as the Dev of the reconcile feature, do you have some insight ? On August 15, 2022 5:42:26 PM UTC, Sivabalan Narayanan ***@***.***> wrote: >reconcilation of schemas work only if missing columns are in the end. > >For eg: >Commit1: >Schema: col1, col2, col3, col4 > >Commit2: >Schema: col1, col2, col3, col4, col5, col6 > >Commit3: old writer. >Schema: col1, col2, col3, col4 >Commit3 will succeed when you enable reconcile schema config. > >In general, adding new colunmns at the end of existing schema works. it may break if you try to add new columns inbetween. for eg, in commit2, if your schema as col1, col5, col6, col2, col3, col4, your write or read may break. > > > > >-- >Reply to this email directly or view it on GitHub: >https://github.com/apache/hudi/issues/6342#issuecomment-1215479286 >You are receiving this because you authored the thread. > >Message ID: ***@***.***> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
yihua commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1218365693 @TengHuo any update on addressing the comment above? Do you need any help? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0
nsivabalan commented on code in PR #6417: URL: https://github.com/apache/hudi/pull/6417#discussion_r948281603 ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write Review Comment: Probably we can rephrase as below. ``` Hudi supports savepoint and restore feature that users can use for disaster and recovery scenarios. More info can be found [here](https://hudi.apache.org/docs/disaster_recovery). Until 0.12, archival for a given table will not make progress beyond first savepointed commit. But there has been ask from the community to relax this constraint so that we can retain some coarse grained commits in active timeline and execute point in time queries. So, with 0.12, users can now let archive proceed beyond savepoint commits by enabling `hoodie.archive.beyond.savepoint`. This unlocks new opportunities for Hudi users. For example, one can retain commits for years, by adding one savepoint per day for older commits (lets say > 30 days). And query hudi table using "as.of.instant" with any older savepointed commit. By this, hudi does not need to retain every commit in the active timeline for older commits. ``` Let me know what you think. ``` ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write Review Comment: do retain the "note" section in the end. ## website/releases/release-0.12.0.md: ## @@ -0,0 +1,143 @@ +--- +title: "Release 0.12.0" +sidebar_position: 2 +layout: releases +toc: true +last_modified_at: 2022-08-17T10:30:00+05:30 +--- +# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) + +## Release Highlights + +### Presto-Hudi Connector + +Since version 0.275 of PrestoDB, users can now leverage native Hudi connector to query Hudi table. +It is on par with Hudi support in the Hive connector. To learn more about the usage of the connector, +please checkout [prestodb documentation](https://prestodb.io/docs/current/connector/hudi.html). + +### Archival Beyond Savepoint + +Users can now archive Hudi table beyond savepoint commit. Just enable `hoodie.archive.beyond.savepoint` write +configuration. This unlocks new opportunities for Hudi users. For example, one can retain commits for years, by adding +one savepoint per day for older commits (say > 30 days old). And they can query hudi using `as.of.instant` semantics for +old data. In previous versions, one would have to retain every commit and let archival stop at the first commit. + +:::note +However, if this feature is enabled, restore cannot be supported. This limitation would be relaxed in a future release +and the development of this feature can be tracked in [HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500). +::: + +### Deltastreamer Termination Strategy + +Users can now configure a post-write termination strategy with deltastreamer `continuous` mode if need be. For instance, +users can configure graceful shutdown if there is no new data from source for 5 consecutive times. Here is the interface +for the termination strategy. +```java +/** + * Post write termination strategy for deltastreamer in continuous mode. + */ +public interface PostWriteTerminationStrategy { + + /** + * Returns whether deltastreamer needs to be shutdown. + * @param scheduledCompactionInstantAndWriteStatuses optional pair of scheduled compaction instant and write statuses. + * @return true if
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5113: [HUDI-3625] [RFC-56] Optimized storage layout for Cloud Object Stores
alexeykudinkin commented on code in PR #5113: URL: https://github.com/apache/hudi/pull/5113#discussion_r948277674 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to +succeed either and continue to be throttled. + +The traditional storage layout also tightly couples the partitions as folders under the table path. However, +some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, +hdfs etc. based on their specific needs. For example, customers have use cases to distribute files for each partition under +a separate S3 bucket with its individual encryption key. It is not possible to implement such use-cases with Hudi currently. + +The high level proposal here is to introduce a new storage layout strategy, where all files are distributed evenly across +multiple randomly generated prefixes under the Amazon S3 bucket, instead of being stored under a common table path/prefix. Review Comment: I think we need to compartmentalize this discussion in 2 actually: 1. Federated Storage support: being able to offload "partitions" to different buckets 2. Addressing the "common prefix" problem For the latter, solution is not necessary the former, and we already have solution for it -- we eliminate physical partitioning altogether, store tables as non-partitioned one and support "logical-partitioning" on top of it (which requires just 2 ingredients, partition-listing and partitioning constraints on what records are stored in files w/in the partition). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #5139: [WIP][HUDI-3579] Add timeline commands in hudi-cli
nsivabalan commented on code in PR #5139: URL: https://github.com/apache/hudi/pull/5139#discussion_r948228142 ## hudi-cli/src/main/java/org/apache/hudi/cli/commands/TimelineCommand.java: ## @@ -0,0 +1,410 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.cli.commands; + +import org.apache.hudi.avro.model.HoodieRollbackMetadata; +import org.apache.hudi.avro.model.HoodieRollbackPlan; +import org.apache.hudi.cli.HoodieCLI; +import org.apache.hudi.cli.HoodiePrintHelper; +import org.apache.hudi.cli.HoodieTableHeaderField; +import org.apache.hudi.cli.TableHeader; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.springframework.shell.core.CommandMarker; +import org.springframework.shell.core.annotation.CliCommand; +import org.springframework.shell.core.annotation.CliOption; +import org.springframework.stereotype.Component; + +import java.io.IOException; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Comparator; +import java.util.Date; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * CLI command to display timeline options. Review Comment: can we add some examples here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #4230: [SUPPORT] org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file
yihua commented on issue #4230: URL: https://github.com/apache/hudi/issues/4230#issuecomment-1218290284 @BenjMaq recently we fixed a bug in handling the timeline-server-based marker requests at the timeline server, which should resolve the problem of marker creation failure: #6383. If you get a chance, could you try the latest master with `hoodie.write.markers.type=TIMELINE_SERVER_BASED` (or using the default by not setting such config) and see if the problem goes away? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on issue #6397: [SUPPORT] spark history server - sql tab
Zouxxyy commented on issue #6397: URL: https://github.com/apache/hudi/issues/6397#issuecomment-1218248368 > Yes exactly that’s what we‘re looking for. Did we miss some configurations? I didn't set other parameters about it, I don't know why you can't display it, maybe it's a web page rendering problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy opened a new issue, #6426: [SUPPORT] Questions about the rollback mechanism of the MOR table
Zouxxyy opened a new issue, #6426: URL: https://github.com/apache/hudi/issues/6426 I'm a hudi beginner and have some questions and hope to get answers ^_^ 1. I would like to know if my understanding of the MOR table rollback mechanism is correct: The rollback logic of the MOR table is to add a new log, which records the rollback log, and hudi will not merge the log marked as rollback in the subsequent compaction 2. Why not just delete the log that needs rollback, Wouldn't it be more convenient? (just like COW table, delete the parquet that needs rollback) Thank you very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218189301 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218177009 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792) * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218169779 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792) * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN * a71b8d7629b59bd969c3a10e5c90dfd020cf084f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 opened a new issue, #6425: [SUPPORT]Writing to MOR table seems not working as expected
bithw1 opened a new issue, #6425: URL: https://github.com/apache/hudi/issues/6425 Hi, I am working with Hudi 0.9.0, and I have following code that writes 10 records to MOR hudi table(one record for each spark job). There are 11 commits in total, when I look at the files written in disk, there is `NO` log file , and there are 11 parquet files. Looks I am writing to a COW table, not sure where the problem is, ``` package org.example import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.{HoodieIndexConfig, HoodieWriteConfig} import org.apache.hudi.index.HoodieIndex import org.apache.spark.sql.{SaveMode, SparkSession} case class Order( name: String, price: String, creation_date: String) object Hudi003_Demo { val spark = SparkSession.builder.appName(this.getClass.getSimpleName). config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .enableHiveSupport() .master("local[1]") .getOrCreate() def write_data(i: Int): Unit = { val hudi_table_name = this.getClass.getSimpleName val base_path = "/data/hudi_demo/" + hudi_table_name import spark.implicits._ val order = Order(name = s"order_$i", price = s"$i-11.3", creation_date = s"date-0") val insertData = spark.createDataset(Seq(order)) //DataFrame Write var writer = insertData.write.format("hudi") .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "name") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "creation_date") .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name()) .option("hoodie.insert.shuffle.parallelism", "1") .option("hoodie.upsert.shuffle.parallelism", "1") .option(HoodieWriteConfig.TABLE_NAME, hudi_table_name) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "creation_date") //Write to MOR Table .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) .option("hoodie.compact.inline", "false") .option("hoodie.compact.inline.max.delta.commits", "1") writer.mode(SaveMode.Append) .save(base_path) } def test1(): Unit = { (0 to 10).foreach { i => write_data(i) } } def main(args: Array[String]): Unit = { test1() } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218100166 ## CI report: * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
hudi-bot commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218093641 ## CI report: * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sufei2009 commented on issue #6411: Hudi Record Key Data Type Must be String
sufei2009 commented on issue #6411: URL: https://github.com/apache/hudi/issues/6411#issuecomment-1218092218 We were not getting any exceptions. However, the partitions were acting inconsistent. Data was only writing to the last partition most of the time. Once we changed primary key type to string, the partitions worked correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table
dongkelun commented on PR #6419: URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218087958 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218086389 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792) * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kylincode closed issue #6423: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema
kylincode closed issue #6423: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema URL: https://github.com/apache/hudi/issues/6423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xxWSHxx opened a new issue, #6424: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema
xxWSHxx opened a new issue, #6424: URL: https://github.com/apache/hudi/issues/6424 After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema Steps to reproduce the behavior: 1. spark sql create table t1, sql: `create table t1 ( id int, name string, price double, ts long ) using hudi location '/tmp/t1' options ( type = 'mor', primaryKey = 'id', preCombineField = 'ts' );` 2. insert data `insert into t1 values(1,'Tom',0.9,1000);` 3. drop price column `alter table t1 drop column price;` 4. Time Travel Query `select * from t1 timestamp as of '20220817161104255'` It is found that when time travel queries historical data, the results show the latest schema (only include id、name、ts columns) instead of the historical schema. **Expected behavior** when time travel queries historical data, the results show the schema of historical time points **Environment Description** * Hudi version : 0.11.0 * Spark version : 3.2.2 * Storage (HDFS/S3/GCS..) : local mac os * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kylincode opened a new issue, #6423: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schem
kylincode opened a new issue, #6423: URL: https://github.com/apache/hudi/issues/6423 After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema Steps to reproduce the behavior: 1. spark sql create table t1, sql: `create table t1 ( id int, name string, price double, ts long ) using hudi location '/tmp/t1' options ( type = 'mor', primaryKey = 'id', preCombineField = 'ts' );` 2. insert data `insert into t1 values(1,'Tom',0.9,1000);` 3. drop price column `alter table t1 drop column price;` 4. Time Travel Query `select * from t1 timestamp as of '20220817161104255'` It is found that when time travel queries historical data, the results show the latest schema (only include id、name、ts columns) instead of the historical schema. **Expected behavior** when time travel queries historical data, the results show the schema of historical time points **Environment Description** * Hudi version : 0.11.0 * Spark version : 3.2.2 * Storage (HDFS/S3/GCS..) : local mac os * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218003704 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ToBeFinder closed issue #6371: [SUPPORT]when flink recovers from savepoint, there will be some data duplication in hudi
ToBeFinder closed issue #6371: [SUPPORT]when flink recovers from savepoint, there will be some data duplication in hudi URL: https://github.com/apache/hudi/issues/6371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1217998066 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * b2d6b015aa283cba1949665771c8fb6caddb7b1f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10741) * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
hudi-bot commented on PR #6170: URL: https://github.com/apache/hudi/pull/6170#issuecomment-1217992427 ## CI report: * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN * b2d6b015aa283cba1949665771c8fb6caddb7b1f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10741) * ab07f137f867a065b5ae3ab422fc3a498d1e888d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies
the-other-tim-brown commented on code in PR #6170: URL: https://github.com/apache/hudi/pull/6170#discussion_r947882426 ## docker/demo/config/log4j2.properties: ## @@ -0,0 +1,60 @@ +### +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +### +status = warn +name = HudiConsoleLog + +# Set everything to be logged to the console +appender.console.type = Console +appender.console.name = CONSOLE +appender.console.layout.type = PatternLayout +appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n + +# Root logger level +rootLogger.level = warn +# Root logger referring to console appender +rootLogger.appenderRef.stdout.ref = CONSOLE + +# Set the default spark-shell log level to WARN. When running the spark-shell, the +# log level for this class is used to overwrite the root logger's log level, so that +# the user can have different defaults for the shell and regular Spark apps. +logger.apache_spark_repl.name = org.apache.spark.repl.Main +logger.apache_spark_repl.level = warn +# Set logging of integration testsuite to INFO level +logger.hudi_integ.name = org.apache.hudi.integ.testsuite +logger.hudi_integ.level = info +# Settings to quiet third party logs that are too verbose +logger.apache_spark_jetty.name = org.spark_project.jetty +logger.apache_spark_jetty.level = warn +logger.apache_spark_jett_lifecycle.name = org.spark_project.jetty.util.component.AbstractLifeCycle Review Comment: This is allows for fine grain control over the logging for the sake of the demo. Should we leave this one instance in the repo since it's for a docker demo? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #5113: [HUDI-3625] [RFC-56] Optimized storage layout for Cloud Object Stores
prasannarajaperumal commented on code in PR #5113: URL: https://github.com/apache/hudi/pull/5113#discussion_r947841796 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to +succeed either and continue to be throttled. + +The traditional storage layout also tightly couples the partitions as folders under the table path. However, +some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, Review Comment: This is a nice abstraction to think about. HudiFile is a logical file path and this gets annotated by PhysicalLocation which contains the cloud provider creds/volume/bucket. ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in