[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW
hudi-bot commented on PR #8666: URL: https://github.com/apache/hudi/pull/8666#issuecomment-1543364782 ## CI report: * 66dd335158e2800359039917a9c1690450fa9c27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16998) * bfa1909d85b174cb466d205ee4ec2af09be1850d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17015) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
hudi-bot commented on PR #8645: URL: https://github.com/apache/hudi/pull/8645#issuecomment-1543364711 ## CI report: * adfb9e2726fb19e05c700ba7e67d080548d51a60 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16959) * 9e5f2984e3f00bb24e1749922458072163a0df70 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17014) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW
hudi-bot commented on PR #8666: URL: https://github.com/apache/hudi/pull/8666#issuecomment-1543359810 ## CI report: * 66dd335158e2800359039917a9c1690450fa9c27 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16998) * bfa1909d85b174cb466d205ee4ec2af09be1850d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
hudi-bot commented on PR #8645: URL: https://github.com/apache/hudi/pull/8645#issuecomment-1543359740 ## CI report: * adfb9e2726fb19e05c700ba7e67d080548d51a60 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16959) * 9e5f2984e3f00bb24e1749922458072163a0df70 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] koldic commented on issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean
koldic commented on issue #7209: URL: https://github.com/apache/hudi/issues/7209#issuecomment-1543356849 Yes sorry for late answer, it seems that another delta streamer was started by another of my instance with the same save path (the table was at the same path) so it wrote some meta data from another delta streamer to the other and that was the problem? I am just curious. Is there any way how to run multiple delta streamers at the same table? For example if I want to run one for each Kafka topic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6201) Timeline server sometimes does not send bootstrap base path for a skeleton file
Jonathan Vexler created HUDI-6201: - Summary: Timeline server sometimes does not send bootstrap base path for a skeleton file Key: HUDI-6201 URL: https://issues.apache.org/jira/browse/HUDI-6201 Project: Apache Hudi Issue Type: Bug Components: bootstrap, timeline-server Reporter: Jonathan Vexler Attachments: TestBootstrapRead.java [^TestBootstrapRead.java] In the attached file, enable the timeline server 'hoodie.embed.timeline.server'. It will occasionally fail in metadata or mixed mode because some records will be null besides the metadata columns: {code:java} +---+-++-++--++--+---+-++--+--+--+---+---++---++--+--+-++-+--+--+|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key |_hoodie_partition_path |_hoodie_file_name |_hoodie_is_deleted|_row_key|begin_lat |begin_lon |city_to_state|current_date|current_ts|distance_in_meters|driver|end_lat |end_lon|fare|height |nation |partition |partition_path|rider|seconds_since_epoch |timestamp|tip_history |weight |+---+-++-++--++--+---+-++--+--+--+---+---++---++--+--+-++-+--+--+|01 |01_4_0 |876743b0-f5e7-4289-b13b-1a0404d94380|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null |null|null |null |null |null|null |null |null |null |null |null|null |null|null |null |null |null|null |null |null ||01 |01_4_1 |00923d1a-58fc-4d42-8953-4a47b47d738f|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null |null|null |null |null |null|null |null |null |null |null |null|null |null|null |null |null |null|null |null |null ||20230510125841762 |20230510125841762_1_2|b318c482-8e43-4614-bdab-80946d5a9f53|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|false |b318c482-8e43-4614-bdab-80946d5a9f53|0.5285807377766387|0.12835359814395741|[CA] |12 |1047178778|521899450 |driver-001|0.41394620067559684|0.08532822423986208|[42.25978252084417, USD]|[0, 0, 7, -91, -36]|[Canada]|2015-03-16|2015-03-16 |rider-001|-2845295541651788027|0|[[32.10533813167099, USD]]|0.59076524||01 |01_4_3 |4dcc72d7-0878-41c2-a85d-3e6374b88bb8|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null |null|null |null |null |null|null |null |null |null |null |null|null |null|null |null |null |null|null |null |null ||01 |01_4_4 |cfa79530-fc9f-42de-a181-34c06e79d9c5|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null |null|null |null |null |null|null |null |null |null |null |null|null |null|null |null
[GitHub] [hudi] eyjian opened a new issue, #8686: [SUPPORT] is not a Parquet file (length is too low: 0)
eyjian opened a new issue, #8686: URL: https://github.com/apache/hudi/issues/8686 **Environment:** Hudi MOR+ Spark **Hudi version :** 0.12.1 **Spark version :** spark3 **Hadoop version:** 2.8.5 **Problem:** java.lang.RuntimeException: hdfs://test/warehouse/test.db/t_test/20220811/b4a54eb9-0e0c-406a-a2bd-c2f07b021277-0_8-10-991_20230421093519585.parquet is not a Parquet file (length is too low: 0) **SQL:** `insert into table_a_rt select * from table_b_rt;` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
danny0405 commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190632083 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
danny0405 commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190631429 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
amrishlal commented on code in PR #8645: URL: https://github.com/apache/hudi/pull/8645#discussion_r1190631130 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Snapshot; +import com.codahale.metrics.UniformReservoir; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.Serializable; +import java.time.LocalDate; +import java.time.format.DateTimeFormatter; +import java.time.format.DateTimeFormatterBuilder; +import java.time.format.DateTimeParseException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Calculate and output file size stats of data files that were modified in the half-open interval [start date (--start-date parameter), + * end date (--end-date parameter)). --num-days parameter can be used to select data files over last --num-days. If --start-date is + * specified, --num-days will be ignored. If none of the date parameters are set, stats will be computed over all data files of all + * partitions in the table. Note that date filtering is carried out only if the partition name has the format '[column name=]-M-d', + * '[column name=]/M/d'. By default, only table level file size stats are printed. If --partition-status option is used, partition + * level file size stats also get printed. + * + * The following stats are calculated: + * Number of files. + * Total table size. + * Minimum file size + * Maximum file size + * Average file size + * Median file size + * p50 file size + * p90 file size + * p95 file size + * p99 file size + * + * Sample spark-submit command: + * ./bin/spark-submit \ + * --class org.apache.hudi.utilities.TableSizeStats \ + * $HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar \ + * --base-path \ + * --num-days + */ +public class TableSizeStats implements Serializable { + + private static final Logger LOG = LoggerFactory.getLogger(TableSizeStats.class); + + // Date formatter for parsing partition dates (example: 2023/5/5/ or 2023-5-5). + private static final DateTimeFormatter DATE_FORMATTER = + (new DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter(); + + // File size stats will be displayed in the units specified below. + private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", "TB"}; + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props;
[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
amrishlal commented on code in PR #8645: URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630713 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Snapshot; +import com.codahale.metrics.UniformReservoir; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.Serializable; +import java.time.LocalDate; +import java.time.format.DateTimeFormatter; +import java.time.format.DateTimeFormatterBuilder; +import java.time.format.DateTimeParseException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Calculate and output file size stats of data files that were modified in the half-open interval [start date (--start-date parameter), + * end date (--end-date parameter)). --num-days parameter can be used to select data files over last --num-days. If --start-date is + * specified, --num-days will be ignored. If none of the date parameters are set, stats will be computed over all data files of all + * partitions in the table. Note that date filtering is carried out only if the partition name has the format '[column name=]-M-d', + * '[column name=]/M/d'. By default, only table level file size stats are printed. If --partition-status option is used, partition + * level file size stats also get printed. + * + * The following stats are calculated: + * Number of files. + * Total table size. + * Minimum file size + * Maximum file size + * Average file size + * Median file size + * p50 file size + * p90 file size + * p95 file size + * p99 file size + * + * Sample spark-submit command: + * ./bin/spark-submit \ + * --class org.apache.hudi.utilities.TableSizeStats \ + * $HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar \ + * --base-path \ + * --num-days + */ +public class TableSizeStats implements Serializable { + + private static final Logger LOG = LoggerFactory.getLogger(TableSizeStats.class); + + // Date formatter for parsing partition dates (example: 2023/5/5/ or 2023-5-5). + private static final DateTimeFormatter DATE_FORMATTER = + (new DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter(); + + // File size stats will be displayed in the units specified below. + private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", "TB"}; + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props;
[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
amrishlal commented on code in PR #8645: URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630254 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Snapshot; +import com.codahale.metrics.UniformReservoir; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.Serializable; +import java.time.LocalDate; +import java.time.format.DateTimeFormatter; +import java.time.format.DateTimeFormatterBuilder; +import java.time.format.DateTimeParseException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Calculate and output file size stats of data files that were modified in the half-open interval [start date (--start-date parameter), + * end date (--end-date parameter)). --num-days parameter can be used to select data files over last --num-days. If --start-date is + * specified, --num-days will be ignored. If none of the date parameters are set, stats will be computed over all data files of all + * partitions in the table. Note that date filtering is carried out only if the partition name has the format '[column name=]-M-d', + * '[column name=]/M/d'. By default, only table level file size stats are printed. If --partition-status option is used, partition + * level file size stats also get printed. + * + * The following stats are calculated: + * Number of files. + * Total table size. + * Minimum file size + * Maximum file size + * Average file size + * Median file size + * p50 file size + * p90 file size + * p95 file size + * p99 file size + * + * Sample spark-submit command: + * ./bin/spark-submit \ + * --class org.apache.hudi.utilities.TableSizeStats \ + * $HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar \ + * --base-path \ + * --num-days + */ +public class TableSizeStats implements Serializable { + + private static final Logger LOG = LoggerFactory.getLogger(TableSizeStats.class); + + // Date formatter for parsing partition dates (example: 2023/5/5/ or 2023-5-5). + private static final DateTimeFormatter DATE_FORMATTER = + (new DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter(); + + // File size stats will be displayed in the units specified below. + private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", "TB"}; + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props;
[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
amrishlal commented on code in PR #8645: URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630112 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Snapshot; +import com.codahale.metrics.UniformReservoir; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.Serializable; +import java.time.LocalDate; +import java.time.format.DateTimeFormatter; +import java.time.format.DateTimeFormatterBuilder; +import java.time.format.DateTimeParseException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Calculate and output file size stats of data files that were modified in the half-open interval [start date (--start-date parameter), + * end date (--end-date parameter)). --num-days parameter can be used to select data files over last --num-days. If --start-date is + * specified, --num-days will be ignored. If none of the date parameters are set, stats will be computed over all data files of all + * partitions in the table. Note that date filtering is carried out only if the partition name has the format '[column name=]-M-d', + * '[column name=]/M/d'. By default, only table level file size stats are printed. If --partition-status option is used, partition + * level file size stats also get printed. + * + * The following stats are calculated: + * Number of files. + * Total table size. + * Minimum file size + * Maximum file size + * Average file size + * Median file size + * p50 file size + * p90 file size + * p95 file size + * p99 file size + * + * Sample spark-submit command: + * ./bin/spark-submit \ + * --class org.apache.hudi.utilities.TableSizeStats \ + * $HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar \ + * --base-path \ + * --num-days + */ +public class TableSizeStats implements Serializable { + + private static final Logger LOG = LoggerFactory.getLogger(TableSizeStats.class); + + // Date formatter for parsing partition dates (example: 2023/5/5/ or 2023-5-5). + private static final DateTimeFormatter DATE_FORMATTER = + (new DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter(); + + // File size stats will be displayed in the units specified below. + private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", "TB"}; + + // Spark context + private transient JavaSparkContext jsc; + // config + private Config cfg; + // Properties with source, hoodie client, key generator etc. + private TypedProperties props;
[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters
amrishlal commented on code in PR #8645: URL: https://github.com/apache/hudi/pull/8645#discussion_r1190629741 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Snapshot; +import com.codahale.metrics.UniformReservoir; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.Serializable; +import java.time.LocalDate; +import java.time.format.DateTimeFormatter; +import java.time.format.DateTimeFormatterBuilder; +import java.time.format.DateTimeParseException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; + +/** + * Calculate and output file size stats of data files that were modified in the half-open interval [start date (--start-date parameter), Review Comment: Fixed. ## hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java: ## @@ -0,0 +1,411 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.view.FileSystemViewManager; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.exception.TableNotFoundException; +import
[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
danny0405 commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190629034 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatibl
BruceKellan commented on issue #8685: URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543321785 That my flink DDL: ```sql create table sink_table ( uniqueKey string, key string, `offset` bigint, `time` bigint, sid int, plat string, pid string, gid string, account string, playerid string, kafka_ts bigint, consume_ts bigint, prop map, `day` string, `type` string ) partitioned by (`day`, `type`) with ( 'connector' = 'hudi', 'path' = 'x', 'table.type' = 'COPY_ON_WRITE', 'write.operation' = 'insert', 'write.tasks' = '12', 'write.task.max.size' = '1024', 'hive_sync.partition_fields' = 'day,type', 'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hadoop.io.compression.codec.zstd.level' = '3', 'hadoop.parquet.compression.codec.zstd.level' = '3', 'hoodie.datasource.write.recordkey.field' = 'uniqueKey', 'hoodie.datasource.write.partitionpath.field' = 'day,type', 'hoodie.datasource.write.hive_style_partitioning' = 'true', 'hoodie.datasource.write.keygenerator.type' = 'COMPLEX', 'hoodie.parquet.compression.codec' = 'zstd', 'hoodie.parquet.dictionary.enabled' = 'true', 'write.insert.cluster' = 'false', 'clean.retain_commits' = '30', 'clustering.schedule.enabled' = 'true', 'clustering.delta_commits' = '8', 'clustering.async.enabled' = 'true', 'clustering.plan.strategy.target.file.max.bytes' = '104857600', 'clustering.plan.strategy.small.file.limit' = '30', 'clustering.plan.strategy.max.num.groups' = '1500', 'hoodie.clustering.plan.strategy.single.group.clustering.enabled' = 'false', 'hoodie.archive.merge.enable' = 'true', 'hoodie.archive.merge.small.file.limit.bytes' = '536870912', 'metadata.enabled' = 'false' ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatibl
BruceKellan commented on issue #8685: URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543319977 BTW, I have merge #8587 on our branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle
hudi-bot commented on PR #8675: URL: https://github.com/apache/hudi/pull/8675#issuecomment-1543320126 ## CI report: * eb510e9ae89ac152e29375c44becfadd02506674 UNKNOWN * f1376bc500e0b86fe595dbfa207d5bd77b8ac5ab UNKNOWN * fd1a22a9c708bf20c05bc4455121a6ba108b5869 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16997) * 875f77fd40a5b93dd7fcfa628e981f575372d6e7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17013) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible
danny0405 commented on issue #8685: URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543316591 Did you create the table using Spark? We did a compatibility fix for Spark and Flink schema recently: https://github.com/apache/hudi/pull/8587, and another speculation is the primary key, the primary is not nulable for Flink definition but nullable for Spark(Spark does not have that constraint). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle
hudi-bot commented on PR #8675: URL: https://github.com/apache/hudi/pull/8675#issuecomment-1543315375 ## CI report: * eb510e9ae89ac152e29375c44becfadd02506674 UNKNOWN * f1376bc500e0b86fe595dbfa207d5bd77b8ac5ab UNKNOWN * fd1a22a9c708bf20c05bc4455121a6ba108b5869 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16997) * 875f77fd40a5b93dd7fcfa628e981f575372d6e7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap
danny0405 commented on code in PR #8640: URL: https://github.com/apache/hudi/pull/8640#discussion_r1190620668 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/SparkBootstrapCommitActionExecutor.java: ## @@ -326,6 +326,8 @@ private Map>>> listAndPr if (!(selector instanceof FullRecordBootstrapModeSelector)) { FullRecordBootstrapModeSelector fullRecordBootstrapModeSelector = new FullRecordBootstrapModeSelector(config); result.putAll(fullRecordBootstrapModeSelector.select(folders)); + } else { +result.putAll(selector.select(folders)); } Review Comment: Can you elaborate a liitle what are we fixing here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
SteNicholas commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190620224 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" Review Comment: Generally speaking, the current implementation of Hudi is worked as micro-batch data lake, not a really streaming lakehouse. Will we propose to build the really streaming lakehouse via Hudi? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
SteNicholas commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190618910 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] hudi-bot commented on pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap
hudi-bot commented on PR #8640: URL: https://github.com/apache/hudi/pull/8640#issuecomment-1543309415 ## CI report: * f1367ab0be48cb01834d3ada76bf2198c620d38c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16848) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17012) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
SteNicholas commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190617763 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
SteNicholas commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190617763 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
SteNicholas commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190616885 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] xushiyan commented on a diff in pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle
xushiyan commented on code in PR #8675: URL: https://github.com/apache/hudi/pull/8675#discussion_r1190611852 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieMergedReadHandle.java: ## @@ -38,113 +41,139 @@ import org.apache.avro.generic.GenericRecord; import org.apache.avro.generic.IndexedRecord; +import org.apache.spark.api.java.JavaRDD; import org.junit.jupiter.params.ParameterizedTest; -import org.junit.jupiter.params.provider.EnumSource; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; +import java.util.Comparator; import java.util.List; -import java.util.Properties; import java.util.stream.Collectors; - -import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields; -import static org.apache.hudi.avro.HoodieAvroUtils.createHoodieRecordFromAvro; -import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.AVRO_SCHEMA; -import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH; +import java.util.stream.Stream; + +import static org.apache.hudi.common.model.HoodieTableType.COPY_ON_WRITE; +import static org.apache.hudi.common.model.HoodieTableType.MERGE_ON_READ; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA_STR; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA_WITH_METAFIELDS; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getDeletes; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getInserts; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getKeyGenProps; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getPayloadProps; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getUpdates; import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC; +import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; public class TestHoodieMergedReadHandle extends SparkClientFunctionalTestHarness { - private HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(); + private static Stream avroPayloadClasses() { +return Stream.of( +Arguments.of(COPY_ON_WRITE, OverwriteWithLatestAvroPayload.class), +Arguments.of(COPY_ON_WRITE, OverwriteNonDefaultsWithLatestAvroPayload.class), +Arguments.of(COPY_ON_WRITE, PartialUpdateAvroPayload.class), +Arguments.of(COPY_ON_WRITE, DefaultHoodieRecordPayload.class), +Arguments.of(MERGE_ON_READ, OverwriteWithLatestAvroPayload.class), +Arguments.of(MERGE_ON_READ, OverwriteNonDefaultsWithLatestAvroPayload.class), +Arguments.of(MERGE_ON_READ, PartialUpdateAvroPayload.class), +Arguments.of(MERGE_ON_READ, DefaultHoodieRecordPayload.class) Review Comment: purposely left out AWSDmsPayload and debezium payload cases https://github.com/apache/hudi/pull/8675/commits/875f77fd40a5b93dd7fcfa628e981f575372d6e7 to land this patch. When doing proper fix for payload, those classes can be added here to cover custom delete marker scenario. ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieMergedReadHandle.java: ## @@ -38,113 +41,139 @@ import org.apache.avro.generic.GenericRecord; import org.apache.avro.generic.IndexedRecord; +import org.apache.spark.api.java.JavaRDD; import org.junit.jupiter.params.ParameterizedTest; -import org.junit.jupiter.params.provider.EnumSource; +import org.junit.jupiter.params.provider.Arguments; +import org.junit.jupiter.params.provider.MethodSource; import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; +import java.util.Comparator; import java.util.List; -import java.util.Properties; import java.util.stream.Collectors; - -import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields; -import static org.apache.hudi.avro.HoodieAvroUtils.createHoodieRecordFromAvro; -import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.AVRO_SCHEMA; -import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH; +import java.util.stream.Stream; + +import static org.apache.hudi.common.model.HoodieTableType.COPY_ON_WRITE; +import static org.apache.hudi.common.model.HoodieTableType.MERGE_ON_READ; +import static org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA; +import static
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190608619 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606838 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606600 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. Review Comment: Well, ingest is completely incremental now - across industry. Once upon a time, it was unthinkable. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606252 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] danny0405 closed issue #8499: [SUPPORT] Support partial insert in merge into command
danny0405 closed issue #8499: [SUPPORT] Support partial insert in merge into command URL: https://github.com/apache/hudi/issues/8499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (65172d3d66a -> 9963b50ee17)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 65172d3d66a [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 (#8082) add 9963b50ee17 [HUDI-6105] Support partial insert in MERGE INTO command (#8597) No new revisions were added by this update. Summary of changes: .../hudi/command/MergeIntoHoodieTableCommand.scala | 38 +++-- .../apache/spark/sql/hudi/TestMergeIntoTable.scala | 39 ++ 2 files changed, 66 insertions(+), 11 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command
danny0405 merged PR #8597: URL: https://github.com/apache/hudi/pull/8597 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190605292 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] danny0405 commented on pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command
danny0405 commented on PR #8597: URL: https://github.com/apache/hudi/pull/8597#issuecomment-1543289530 The failed test: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=16733=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7=10927 `testArchivalWithMultiWriters` is falky and can not reproduce in local, would merge it soon~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604359 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. Review Comment: +1 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604259 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
vinothchandar commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604067 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model
[GitHub] [hudi] weimingdiit commented on pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap
weimingdiit commented on PR #8640: URL: https://github.com/apache/hudi/pull/8640#issuecomment-1543284008 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle
nsivabalan commented on code in PR #8675: URL: https://github.com/apache/hudi/pull/8675#discussion_r1190600637 ## hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieAdaptablePayloadDataGenerator.java: ## @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.testutils; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.model.AWSDmsAvroPayload; +import org.apache.hudi.common.model.DefaultHoodieRecordPayload; +import org.apache.hudi.common.model.HoodieAvroIndexedRecord; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.MetadataValues; +import org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload; +import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.common.model.PartialUpdateAvroPayload; +import org.apache.hudi.common.model.debezium.DebeziumConstants; +import org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload; +import org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.util.Option; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Properties; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.model.HoodieRecord.HOODIE_IS_DELETED_FIELD; +import static org.apache.hudi.common.util.ValidationUtils.checkArgument; + +public class HoodieAdaptablePayloadDataGenerator { + + public static final Schema SCHEMA = SchemaTestUtil.getSchemaFromResource(HoodieAdaptablePayloadDataGenerator.class, "/adaptable-payload.avsc"); + public static final Schema SCHEMA_WITH_METAFIELDS = HoodieAvroUtils.addMetadataFields(SCHEMA, false); + public static final String SCHEMA_STR = SCHEMA.toString(); + + public static Properties getKeyGenProps(Class payloadClass) { +String orderingField = new RecordGen(payloadClass).getOrderingField(); +Properties props = new Properties(); +props.put("hoodie.datasource.write.recordkey.field", "id"); +props.put("hoodie.datasource.write.partitionpath.field", "pt"); +props.put("hoodie.datasource.write.precombine.field", orderingField); +props.put(HoodieTableConfig.RECORDKEY_FIELDS.key(), "id"); +props.put(HoodieTableConfig.PARTITION_FIELDS.key(), "pt"); +props.put(HoodieTableConfig.PRECOMBINE_FIELD.key(), orderingField); +return props; + } + + public static Properties getPayloadProps(Class payloadClass) { +String orderingField = new RecordGen(payloadClass).getOrderingField(); +Properties props = new Properties(); +props.put("hoodie.compaction.payload.class", payloadClass.getName()); +props.put("hoodie.payload.event.time.field", orderingField); +props.put("hoodie.payload.ordering.field", orderingField); +return props; + } + + public static List getInserts(int n, String partition, long ts, Class payloadClass) throws IOException { +return getInserts(n, new String[] {partition}, ts, payloadClass); + } + + public static List getInserts(int n, String[] partitions, long ts, Class payloadClass) throws IOException { +List inserts = new ArrayList<>(); +RecordGen recordGen = new RecordGen(payloadClass); +for (GenericRecord r : getInserts(n, partitions, ts, recordGen)) { + inserts.add(getHoodieRecord(r, recordGen.getPayloadClass())); +} +return inserts; + } + + private static List getInserts(int n, String[] partitions, long ts, RecordGen recordGen) { +return IntStream.range(0, n).mapToObj(id -> { + String pt = partitions.length == 0 ? "" : partitions[id % partitions.length]; + return getInsert(id, pt, ts, recordGen); +}).collect(Collectors.toList()); + } + + private static GenericRecord getInsert(int id, String pt, long ts, RecordGen recordGen) { +GenericRecord r = new GenericData.Record(SCHEMA); +
[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] Flink append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible types: req
BruceKellan commented on issue #8685: URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543272627 @danny0405 can you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan opened a new issue, #8685: [SUPPORT] Flink append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible types: re
BruceKellan opened a new issue, #8685: URL: https://github.com/apache/hudi/issues/8685 **To Reproduce** I am using flink 1.13.6 and hudi 0.13.0. When aysnc clustering job scheduled, will throw exception: ```java 2023-05-11 11:06:35,604 ERROR org.apache.hudi.sink.clustering.ClusteringOperator [] - Executor executes action [Execute clustering for instant 2023050411858 from task 1] error org.apache.hudi.exception.HoodieException: unable to read next record from parquet file at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.common.util.MappingIterator.hasNext(MappingIterator.java:35) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.common.util.MappingIterator.hasNext(MappingIterator.java:35) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811) ~[?:1.8.0_332] at java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:295) ~[?:1.8.0_332] at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:207) ~[?:1.8.0_332] at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:162) ~[?:1.8.0_332] at java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:301) ~[?:1.8.0_332] at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) ~[?:1.8.0_332] at org.apache.hudi.client.utils.ConcatenatingIterator.hasNext(ConcatenatingIterator.java:45) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.sink.clustering.ClusteringOperator.doClustering(ClusteringOperator.java:261) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.sink.clustering.ClusteringOperator.lambda$processElement$0(ClusteringOperator.java:194) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_332] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_332] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332] Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file oss://xxx/hudi/datalog/today/db/table/day=2023-05-11/type=aa/7b1a5921-1d37-435e-b74a-0c0a5356b7bc-20_5-8-0_20230511105641808.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] ... 15 more Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required binary key (STRING) != optional binary key (STRING) at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:69) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.schema.GroupType.accept(GroupType.java:256) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:69) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.schema.GroupType.accept(GroupType.java:256) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0] at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-1543269796 > Hardcoding Murmur is likely a good idea Not hardcoding, I mean to make it configurable, the use choose the algorithm they desire to use. > it would allow to support both spark 2 and all spark 3 releases. We can dig that further, but hitherto I would rather keep it simple as first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
danny0405 commented on code in PR #8682: URL: https://github.com/apache/hudi/pull/8682#discussion_r1190589930 ## .github/workflows/bot.yml: ## @@ -27,20 +27,8 @@ jobs: strategy: matrix: include: - - scalaProfile: "scala-2.11" -sparkProfile: "spark2.4" - - - scalaProfile: "scala-2.12" -sparkProfile: "spark2.4" - - - scalaProfile: "scala-2.12" -sparkProfile: "spark3.1" - - - scalaProfile: "scala-2.12" -sparkProfile: "spark3.2" - - scalaProfile: "scala-2.12" -sparkProfile: "spark3.3" +sparkProfile: "spark3.4" Review Comment: Why we have so many changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5868) Upgrade Spark to 3.3.2
[ https://issues.apache.org/jira/browse/HUDI-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5868. Resolution: Fixed Fixed via master branch: 65172d3d66ad299f31b0f21ba004024236006634 > Upgrade Spark to 3.3.2 > -- > > Key: HUDI-5868 > URL: https://issues.apache.org/jira/browse/HUDI-5868 > Project: Apache Hudi > Issue Type: Task >Reporter: Rahil Chertara >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5868) Upgrade Spark to 3.3.2
[ https://issues.apache.org/jira/browse/HUDI-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-5868: - Fix Version/s: 0.13.1 0.14.0 > Upgrade Spark to 3.3.2 > -- > > Key: HUDI-5868 > URL: https://issues.apache.org/jira/browse/HUDI-5868 > Project: Apache Hudi > Issue Type: Task >Reporter: Rahil Chertara >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (5bab66498c8 -> 65172d3d66a)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 5bab66498c8 [HUDI-6122] Unify call procedure options (#8537) add 65172d3d66a [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 (#8082) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java| 2 ++ .../src/main/scala/org/apache/hudi/HoodieSparkUtils.scala | 1 + .../src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala | 5 - .../datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala| 6 +- pom.xml | 2 ++ 5 files changed, 14 insertions(+), 2 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #8082: [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2
danny0405 merged PR #8082: URL: https://github.com/apache/hudi/pull/8082 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] houhang1005 closed pull request #8665: [HUDI-6190] Fix the default value of RECORD_KEY_FIELD.
houhang1005 closed pull request #8665: [HUDI-6190] Fix the default value of RECORD_KEY_FIELD. URL: https://github.com/apache/hudi/pull/8665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8659: [HUDI-6155] Fix cleaner based on hours for earliest commit to retain
danny0405 commented on code in PR #8659: URL: https://github.com/apache/hudi/pull/8659#discussion_r1190585368 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java: ## @@ -144,4 +144,10 @@ public static boolean isValidInstantTime(String instantTime) { return false; } } + + private static ZoneId getZoneId() { +return commitTimeZone.equals(HoodieTimelineTimeZone.LOCAL) +? ZoneId.systemDefault() Review Comment: See the discussions we take in: https://github.com/apache/hudi/pull/8631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.
jenu9417 commented on issue #7991: URL: https://github.com/apache/hudi/issues/7991#issuecomment-1543243868 @nsivabalan / @HEPBO3AH Will check in the new version, if this issue is fixed. But I also wanted to understand the correlation between various types of API hits (specifically LIST and HEAD) per write to 1 partition. Like for each write to 1 partition, how many GET, HEAD, PUT, LIST operations are happening. This will help us to do cost estimation effectively for the project. Can you please provide some insights here? Or any corresponding documentation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] beyond1920 commented on a diff in pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command
beyond1920 commented on code in PR #8597: URL: https://github.com/apache/hudi/pull/8597#discussion_r1190574858 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala: ## @@ -467,11 +471,19 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie case None => // In case partial assignments are allowed and there's no corresponding conditional assignment, // create a self-assignment for the target table's attribute -if (allowPartialAssignments) { - Assignment(attr, attr) -} else { - throw new AnalysisException(s"Assignment expressions have to assign every attribute of target table " + -s"(provided: `${assignments.map(_.sql).mkString(",")}`") +partialAssigmentMode match { + case Some(mode) => +mode match { + case PartialAssignmentMode.NULL_VALUE => +Assignment(attr, Literal(null)) + case PartialAssignmentMode.ORIGINAL_VALUE => +Assignment(attr, attr) + case PartialAssignmentMode.DEFAULT_VALUE => +Assignment(attr, Literal.default(attr.dataType)) +} + case _ => +throw new AnalysisException(s"Assignment expressions have to assign every attribute of target table " + Review Comment: `Delete` would not hit this branch. ![image](https://github.com/apache/hudi/assets/1525333/8917cb76-fc4d-4077-9a17-39f7e6d89a8a) ![image](https://github.com/apache/hudi/assets/1525333/b642f4ef-fc5a-4dd1-a87a-dc27553f886c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6122] Unify call procedure options (#8537)
This is an automated email from the ASF dual-hosted git repository. biyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 5bab66498c8 [HUDI-6122] Unify call procedure options (#8537) 5bab66498c8 is described below commit 5bab66498c857c32e042488db72a018b83ea926a Author: Zouxxyy AuthorDate: Thu May 11 09:55:32 2023 +0800 [HUDI-6122] Unify call procedure options (#8537) --- .../scala/org/apache/hudi/HoodieCLIUtils.scala | 8 + .../procedures/ArchiveCommitsProcedure.scala | 4 +- .../hudi/command/procedures/BaseProcedure.scala| 48 ++-- .../procedures/CommitsCompareProcedure.scala | 4 +- .../command/procedures/CopyToTableProcedure.scala | 4 +- .../hudi/command/procedures/CopyToTempView.scala | 4 +- .../procedures/CreateMetadataTableProcedure.scala | 2 +- .../procedures/CreateSavepointProcedure.scala | 6 +- .../command/procedures/DeleteMarkerProcedure.scala | 4 +- .../procedures/DeleteMetadataTableProcedure.scala | 2 +- .../procedures/DeleteSavepointProcedure.scala | 6 +- .../procedures/ExportInstantsProcedure.scala | 4 +- .../procedures/HdfsParquetImportProcedure.scala| 14 +- .../hudi/command/procedures/HelpProcedure.scala| 4 +- .../command/procedures/HiveSyncProcedure.scala | 2 +- .../procedures/InitMetadataTableProcedure.scala| 2 +- .../command/procedures/ProcedureParameter.scala| 7 +- .../RepairAddpartitionmetaProcedure.scala | 2 +- .../RepairCorruptedCleanFilesProcedure.scala | 2 +- .../procedures/RepairDeduplicateProcedure.scala| 6 +- .../RepairMigratePartitionMetaProcedure.scala | 2 +- .../RepairOverwriteHoodiePropsProcedure.scala | 4 +- .../RollbackToInstantTimeProcedure.scala | 4 +- .../procedures/RollbackToSavepointProcedure.scala | 6 +- .../command/procedures/RunBootstrapProcedure.scala | 10 +- .../command/procedures/RunCleanProcedure.scala | 48 ++-- .../procedures/RunClusteringProcedure.scala| 34 ++- .../procedures/RunCompactionProcedure.scala| 15 +- .../procedures/ShowArchivedCommitsProcedure.scala | 2 +- .../procedures/ShowBootstrapMappingProcedure.scala | 2 +- .../ShowBootstrapPartitionsProcedure.scala | 2 +- .../procedures/ShowClusteringProcedure.scala | 4 +- .../ShowCommitExtraMetadataProcedure.scala | 6 +- .../procedures/ShowCommitFilesProcedure.scala | 4 +- .../procedures/ShowCommitPartitionsProcedure.scala | 4 +- .../procedures/ShowCommitWriteStatsProcedure.scala | 4 +- .../command/procedures/ShowCommitsProcedure.scala | 2 +- .../procedures/ShowCompactionProcedure.scala | 4 +- .../procedures/ShowFileSystemViewProcedure.scala | 6 +- .../procedures/ShowFsPathDetailProcedure.scala | 2 +- .../ShowHoodieLogFileMetadataProcedure.scala | 4 +- .../ShowHoodieLogFileRecordsProcedure.scala| 4 +- .../procedures/ShowInvalidParquetProcedure.scala | 2 +- .../ShowMetadataTableFilesProcedure.scala | 2 +- .../ShowMetadataTablePartitionsProcedure.scala | 2 +- .../ShowMetadataTableStatsProcedure.scala | 2 +- .../procedures/ShowRollbacksProcedure.scala| 6 +- .../procedures/ShowSavepointsProcedure.scala | 4 +- .../procedures/ShowTablePropertiesProcedure.scala | 4 +- .../procedures/StatsFileSizeProcedure.scala| 2 +- .../StatsWriteAmplificationProcedure.scala | 2 +- .../procedures/UpgradeOrDowngradeProcedure.scala | 4 +- .../procedures/ValidateHoodieSyncProcedure.scala | 10 +- .../ValidateMetadataTableFilesProcedure.scala | 2 +- .../sql/hudi/procedure/TestCleanProcedure.scala| 269 ++--- .../hudi/procedure/TestCompactionProcedure.scala | 45 56 files changed, 403 insertions(+), 261 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala index 5f0cba6fd7c..c9f5a8a1215 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala @@ -22,6 +22,7 @@ package org.apache.hudi import org.apache.hudi.avro.model.HoodieClusteringGroup import org.apache.hudi.client.SparkRDDWriteClient import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.StringUtils import org.apache.spark.SparkException import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.sql.SparkSession @@ -90,4 +91,11 @@ object HoodieCLIUtils { throw new SparkException(s"Unsupported identifier $table") } } + + def
[GitHub] [hudi] YannByron merged pull request #8537: [HUDI-6122] Unify options in call procedure
YannByron merged PR #8537: URL: https://github.com/apache/hudi/pull/8537 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
BruceKellan commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190542924 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments
hudi-bot commented on PR #8683: URL: https://github.com/apache/hudi/pull/8683#issuecomment-1543040480 ## CI report: * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17006) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542977709 ## CI report: * e2785f4675ddf74582ff34590608a5d71c5e9a2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17008) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542969346 ## CI report: * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17007) * e2785f4675ddf74582ff34590608a5d71c5e9a2d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] samserpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
samserpoosh commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1542967885 @the-other-tim-brown There's a good chance that this is caused by the **input Kafka topic's events** and how they're serialized/deserialized (i.e. the way Debezium Connector is currently shaping and publishing the change-log messages to Kafka). I leveraged the `kafka-avro-console-consumer` that comes with Confluent's Schema Registry, and here's how my dummy/test table's change-log events look like: ```json { "before": null, "after": { "..samser_customers.Value": { "id": 1, "name": "Bob", "age": 40, "created_at": { "long": 1683661733071814 }, "event_ts": { "long": 168198480 } } }, "source": { "version": "2.1.2.Final", "connector": "postgresql", "name": "pg_dev8", "ts_ms": 1683734195621, "snapshot": { "string": "first_in_data_collection" }, "db": "", "sequence": { "string": "[null,\"1213462492184\"]" }, "schema": "public", "table": "samser_customers", "txId": { "long": 806227 }, "lsn": { "long": 1213462492184 }, "xmin": null }, "op": "r", "ts_ms": { "long": 1683734196050 }, "transaction": null } ``` And here's the corresponding schema which was established in the Schema Registry: ```json { "type": "record", "name": "Envelope", "namespace": "..samser_customers", "fields": [ { "name": "before", "type": [ "null", { "type": "record", "name": "Value", "fields": [ { "name": "id", "type": { "type": "int", "connect.default": 0 }, "default": 0 }, { "name": "name", "type": "string" }, { "name": "age", "type": "int" }, { "name": "created_at", "type": [ "null", { "type": "long", "connect.version": 1, "connect.name": "io.debezium.time.MicroTimestamp" } ], "default": null }, { "name": "event_ts", "type": [ "null", "long" ], "default": null } ], "connect.name": "..samser_customers.Value" } ], "default": null }, { "name": "after", "type": [ "null", "Value" ], "default": null }, { "name": "source", "type": { "type": "record", "name": "Source", "namespace": "io.debezium.connector.postgresql", "fields": [ { "name": "version", "type": "string" }, { "name": "connector", "type": "string" }, { "name": "name", "type": "string" }, { "name": "ts_ms", "type": "long" }, { "name": "snapshot", "type": [ { "type": "string", "connect.version": 1, "connect.parameters": { "allowed": "true,last,false,incremental" }, "connect.default": "false", "connect.name": "io.debezium.data.Enum" }, "null" ], "default": "false" }, { "name": "db", "type": "string" }, { "name": "sequence", "type": [ "null", "string" ], "default": null }, { "name": "schema", "type": "string" }, { "name": "table", "type": "string" }, { "name": "txId", "type": [ "null", "long" ], "default": null }, { "name": "lsn", "type": [ "null", "long"
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542965347 ## CI report: * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN * d3756a68d846716a0ebfc6ae546249fe362e7d6f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17005) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
hudi-bot commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542964544 ## CI report: * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542935795 ## CI report: * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004) * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN * d3756a68d846716a0ebfc6ae546249fe362e7d6f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542935837 ## CI report: * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17007) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments
hudi-bot commented on PR #8683: URL: https://github.com/apache/hudi/pull/8683#issuecomment-1542935815 ## CI report: * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17006) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542931783 ## CI report: * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments
hudi-bot commented on PR #8683: URL: https://github.com/apache/hudi/pull/8683#issuecomment-1542931756 ## CI report: * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542931724 ## CI report: * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004) * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs
hudi-bot commented on PR #8643: URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542927620 ## CI report: * d415e503be584d30b784eade8cd8a63e13f81457 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17001) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6200) Enhancements to the MDT for improving performance of larger indexes
[ https://issues.apache.org/jira/browse/HUDI-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6200: - Labels: pull-request-available (was: ) > Enhancements to the MDT for improving performance of larger indexes > --- > > Key: HUDI-6200 > URL: https://issues.apache.org/jira/browse/HUDI-6200 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] prashantwason opened a new pull request, #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
prashantwason opened a new pull request, #8684: URL: https://github.com/apache/hudi/pull/8684 [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes. ### Change Logs TBD ### Impact TBD ### Risk level (write none, low medium or high below) TBD ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6200) Enhancements to the MDT for improving performance of larger indexes
Prashant Wason created HUDI-6200: Summary: Enhancements to the MDT for improving performance of larger indexes Key: HUDI-6200 URL: https://issues.apache.org/jira/browse/HUDI-6200 Project: Apache Hudi Issue Type: Improvement Reporter: Prashant Wason Assignee: Prashant Wason -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] dineshbganesan commented on issue #8667: Table exception after enabling inline clustering
dineshbganesan commented on issue #8667: URL: https://github.com/apache/hudi/issues/8667#issuecomment-1542902319 @ad1happy2go I'm using AWS Glue job to process the data which comes with the default version 0.12.1. I am not sure how to use a patch with Glue job. Can you clarify? The logs show 3 different exceptions: - org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :109 - org.apache.hudi.exception.HoodieException: unable to read next record from parquet file - org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://datalake-curatedtestng/ccbmod/cisadm/ci_ft/table_name=CI_FT/3a6a1bb9-0ba2-485c-8c13-a4020259ee5d-1_4-552-3668_20230506151911182.parquet Can you help me understand what's the root cause of these exceptions? Regards, Dinesh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5533) Table comments not showing up on spark-sql describe
[ https://issues.apache.org/jira/browse/HUDI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5533: - Labels: pull-request-available (was: ) > Table comments not showing up on spark-sql describe > --- > > Key: HUDI-5533 > URL: https://issues.apache.org/jira/browse/HUDI-5533 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Jonathan Vexler >Priority: Minor > Labels: pull-request-available > > If you add a comment to the schema and write to a hudi table, the comment > will show as null when using spark-sql describe on the table. > > User reported issue [https://github.com/apache/hudi/issues/7531] with a very > good reproducible example. The issue presented when I tried the example. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] parisni opened a new pull request, #8683: [HUDI-5533] Support spark columns comments
parisni opened a new pull request, #8683: URL: https://github.com/apache/hudi/pull/8683 ### Change Logs fixes #7531 ie: show comments within spark schemas ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [X] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542885439 ## CI report: * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
the-other-tim-brown commented on code in PR #8574: URL: https://github.com/apache/hudi/pull/8574#discussion_r1190425359 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java: ## @@ -45,4 +47,9 @@ public interface Transformer { */ @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE) Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties); + + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + default Option transformedSchema(JavaSparkContext jsc, SparkSession sparkSession, Schema incomingSchema, TypedProperties properties) { Review Comment: I don't think it makes sense for this to return an Option. All rows will have a schema of some sorts so this option would never be empty in practice. ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java: ## @@ -45,4 +47,9 @@ public interface Transformer { */ @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE) Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties); + + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + default Option transformedSchema(JavaSparkContext jsc, SparkSession sparkSession, Schema incomingSchema, TypedProperties properties) { +return Option.empty(); Review Comment: The default here in my opinion should create an empty dataset with the `incomingSchema` and then apply the transformer and call `.schema()` on the resulting dataset to get the struct type and convert that back to avro. Another note, since transforms deal with Rows, does it make more sense to track the schema as a StructType? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Sam-Serpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException
Sam-Serpoosh commented on issue #8519: URL: https://github.com/apache/hudi/issues/8519#issuecomment-1542839452 > for the first issue regarding the schema, this is because we are fetching that schema as a string. If that class is not defined in the string, we won't know how it is defined. Maybe there is some arg to pass to the api to get the schemas that this schema relies on as well? @the-other-tim-brown That makes perfect sense and I ended up resolving that issue by simply using **Confluent Schema Registry** instead of the `Apicurio` I was previously using. Since Confluent's includes everything correctly in **one place** so Hudi/DeltaStreamer can fetch it in one-swoop properly. > For the second, it is hard to tell without looking at your data. If you pull the data locally and step through, you may have a better shot of understanding. The main thing I have seen trip people up is the requirements for the delete records in the topic. You can also try out the same patch Sydney posted above for filtering out the tombstones in kafka. I **highly** doubt in my case it's caused by a tombstone record or the like. Because I'm testing this Data-Flow on a dummy/test Postgres table to which I've **only** applied `INSERT` operations so far. And BTW, I could **successfully** get a **vanilla Kafka ingestion** running end-to-end and populate a partitioned Hudi table as expected. So definitely the issue is specific to when I switch to `PostgresDebeziumSource` and `PostgresDebeziumAvroPayload`. Thank you very much for your input. I'll try to see what's the best way to debug this and how to figure out what's causing the exception I shared above when it comes to DeltaStreamer <> Debezium ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542840841 ## CI report: * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
hudi-bot commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542839511 ## CI report: * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002) * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade
hudi-bot commented on PR #8682: URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542833998 ## CI report: * 5369ed017405d0484e5913d184a96fcd958a2a17 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
hudi-bot commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542832656 ## CI report: * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002) * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN * a95196cb0c749c1e1e8fb245a2a58d429159d519 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
hudi-bot commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542824483 ## CI report: * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002) * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519 It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur with 0.13 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table URL: https://github.com/apache/hudi/issues/8259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
nsivabalan commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190389085 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
nsivabalan commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190108469 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. Review Comment: you started to call "batch" as old school ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental
[GitHub] [hudi] kazdy commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
kazdy commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542814478 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6198) Support Spark 3.4.0
[ https://issues.apache.org/jira/browse/HUDI-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6198: - Labels: pull-request-available (was: ) > Support Spark 3.4.0 > --- > > Key: HUDI-6198 > URL: https://issues.apache.org/jira/browse/HUDI-6198 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Shawn Chang >Priority: Major > Labels: pull-request-available > > Support Spark 3.4.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] mansipp opened a new pull request, #8682: [HUDI-6198] Spark 3.4.0 Upgrade
mansipp opened a new pull request, #8682: URL: https://github.com/apache/hudi/pull/8682 ### Change Logs Changes to support Spark 3.3.0 ### Impact Upgrade Spark to 3.4.0 ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update Need doc update ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-5820) Improve Azure and GH CI's maven build with cache (3.9+)
[ https://issues.apache.org/jira/browse/HUDI-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-5820: - Labels: pull-request-available (was: ) > Improve Azure and GH CI's maven build with cache (3.9+) > --- > > Key: HUDI-5820 > URL: https://issues.apache.org/jira/browse/HUDI-5820 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Raymond Xu >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Refer to PR https://github.com/apache/hudi/pull/7935 > For Azure, we can try downloading and installing maven 3.9 and use the custom > maven in the maven@4 task. > For GH actions CI, more investigation needed -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] kazdy closed pull request #7935: [HUDI-5820] Add maven-build-cache-extension
kazdy closed pull request #7935: [HUDI-5820] Add maven-build-cache-extension URL: https://github.com/apache/hudi/pull/7935 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X
kazdy commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1190332409 ## rfc/rfc-69/rfc-69.md: ## @@ -0,0 +1,159 @@ + +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for Hudi
[jira] [Updated] (HUDI-6199) CDC payload with op field for deletes do not work
[ https://issues.apache.org/jira/browse/HUDI-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6199: Description: Delete operation in custom payload after RFC-46: while looking into a 0.13.1 release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom payload implementation like AWS DMS payload and Debezium payload are not properly migrated to the new APIs introduced by RFC-46, causing the delete operation to fail. Our tests did not catch this. It is currently assumed that delete records are marked by "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark deletes. Impact: OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) are not affected. for any other custom payloads: (AWSDMSAvropayload, All debezium payloads) deletes are broken. If someone is using "_is_hoodie_deleted" to enforce deletes, there are no issues w/ custome payloads. COW: deleting a non-existant will break if not using "_is_hoodie_deleted" way. MOR: any deletes will break if not using "_is_hoodie_deleted" way. Writer: all writers(spark, flink) except spark-sql. DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected. was: Delete operation in custom payload after RFC-46: while looking into a 0.13.1 release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom payload implementation like AWS DMS payload and Debezium payload are not properly migrated to the new APIs introduced by RFC-46, causing the delete operation to fail. Our tests did not catch this. It is currently assumed that delete records are marked by "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark deletes. Impact: OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) no issues. for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) deletes are broken. If someone is using "_is_hoodie_deleted" to enforce deletes, there are no issues w/ custome payloads. COW: deleting a non-existant will break if not using "_is_hoodie_deleted" way. MOR: any deletes will break if not using "_is_hoodie_deleted" way. Writer: all writers(spark, flink) except spark-sql. DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected. > CDC payload with op field for deletes do not work > - > > Key: HUDI-6199 > URL: https://issues.apache.org/jira/browse/HUDI-6199 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.1 > > > Delete operation in custom payload after RFC-46: while looking into a 0.13.1 > release [blocker|https://github.com/apache/hudi/pull/8573], I found that > custom payload implementation like AWS DMS payload and Debezium payload are > not properly migrated to the new APIs introduced by RFC-46, causing the > delete operation to fail. Our tests did not catch this. > > It is currently assumed that delete records are marked by > "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark > deletes. > > Impact: > OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) > are not affected. > for any other custom payloads: (AWSDMSAvropayload, All debezium payloads) > deletes are broken. > If someone is using "_is_hoodie_deleted" to enforce deletes, there are no > issues w/ custome payloads. > COW: > deleting a non-existant will break if not using "_is_hoodie_deleted" way. > MOR: > any deletes will break if not using "_is_hoodie_deleted" way. > Writer: > all writers(spark, flink) except spark-sql. > DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false
hudi-bot commented on PR #7998: URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542777240 ## CI report: * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764) * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs
hudi-bot commented on PR #8643: URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542770867 ## CI report: * dc7c3bf6c199ff40a02058d1cd58a6853153b7eb Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17000) * d415e503be584d30b784eade8cd8a63e13f81457 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17001) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs
hudi-bot commented on PR #8643: URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542762628 ## CI report: * dc7c3bf6c199ff40a02058d1cd58a6853153b7eb Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17000) * d415e503be584d30b784eade8cd8a63e13f81457 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7057: [SUPPORT] [OCC] HoodieException: Error getting all file groups in pending clustering
nsivabalan commented on issue #7057: URL: https://github.com/apache/hudi/issues/7057#issuecomment-1542762155 hey @KnightChess : sorry, we did not get to triage this. Are you still facing issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6199) CDC payload with op field for deletes do not work
[ https://issues.apache.org/jira/browse/HUDI-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6199: Description: Delete operation in custom payload after RFC-46: while looking into a 0.13.1 release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom payload implementation like AWS DMS payload and Debezium payload are not properly migrated to the new APIs introduced by RFC-46, causing the delete operation to fail. Our tests did not catch this. It is currently assumed that delete records are marked by "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark deletes. Impact: OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) no issues. for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) deletes are broken. If someone is using "_is_hoodie_deleted" to enforce deletes, there are no issues w/ custome payloads. COW: deleting a non-existant will break if not using "_is_hoodie_deleted" way. MOR: any deletes will break if not using "_is_hoodie_deleted" way. Writer: all writers(spark, flink) except spark-sql. DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected. > CDC payload with op field for deletes do not work > - > > Key: HUDI-6199 > URL: https://issues.apache.org/jira/browse/HUDI-6199 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.1 > > > Delete operation in custom payload after RFC-46: while looking into a 0.13.1 > release [blocker|https://github.com/apache/hudi/pull/8573], I found that > custom payload implementation like AWS DMS payload and Debezium payload are > not properly migrated to the new APIs introduced by RFC-46, causing the > delete operation to fail. Our tests did not catch this. > > It is currently assumed that delete records are marked by > "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark > deletes. > > Impact: > OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) > no issues. > for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) > deletes are broken. > If someone is using "_is_hoodie_deleted" to enforce deletes, there are no > issues w/ custome payloads. > COW: > deleting a non-existant will break if not using "_is_hoodie_deleted" way. > MOR: > any deletes will break if not using "_is_hoodie_deleted" way. > Writer: > all writers(spark, flink) except spark-sql. > DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected. -- This message was sent by Atlassian Jira (v8.20.10#820010)