[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8666:
URL: https://github.com/apache/hudi/pull/8666#issuecomment-1543364782

   
   ## CI report:
   
   * 66dd335158e2800359039917a9c1690450fa9c27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16998)
 
   * bfa1909d85b174cb466d205ee4ec2af09be1850d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1543364711

   
   ## CI report:
   
   * adfb9e2726fb19e05c700ba7e67d080548d51a60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16959)
 
   * 9e5f2984e3f00bb24e1749922458072163a0df70 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8666:
URL: https://github.com/apache/hudi/pull/8666#issuecomment-1543359810

   
   ## CI report:
   
   * 66dd335158e2800359039917a9c1690450fa9c27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16998)
 
   * bfa1909d85b174cb466d205ee4ec2af09be1850d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1543359740

   
   ## CI report:
   
   * adfb9e2726fb19e05c700ba7e67d080548d51a60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16959)
 
   * 9e5f2984e3f00bb24e1749922458072163a0df70 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] koldic commented on issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean

2023-05-10 Thread via GitHub


koldic commented on issue #7209:
URL: https://github.com/apache/hudi/issues/7209#issuecomment-1543356849

   Yes sorry for late answer, it seems that another delta streamer was started 
by another of my instance with the same save path (the table was at the same 
path) so it wrote some meta data from another delta streamer to the other and 
that was the problem? 
   
   I am just curious. Is there any way how to run multiple delta streamers at 
the same table? For example if I want to run one for each Kafka topic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6201) Timeline server sometimes does not send bootstrap base path for a skeleton file

2023-05-10 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6201:
-

 Summary: Timeline server sometimes does not send bootstrap base 
path for a skeleton file
 Key: HUDI-6201
 URL: https://issues.apache.org/jira/browse/HUDI-6201
 Project: Apache Hudi
  Issue Type: Bug
  Components: bootstrap, timeline-server
Reporter: Jonathan Vexler
 Attachments: TestBootstrapRead.java

[^TestBootstrapRead.java] In the attached file, enable the timeline server 

'hoodie.embed.timeline.server'. It will occasionally fail in metadata or mixed 
mode because some records will be null besides the metadata columns:
{code:java}
+---+-++-++--++--+---+-++--+--+--+---+---++---++--+--+-++-+--+--+|_hoodie_commit_time|_hoodie_commit_seqno
 |_hoodie_record_key  |_hoodie_partition_path   
|_hoodie_file_name   
|_hoodie_is_deleted|_row_key|begin_lat 
|begin_lon  
|city_to_state|current_date|current_ts|distance_in_meters|driver|end_lat
|end_lon|fare|height 
|nation  |partition |partition_path|rider|seconds_since_epoch 
|timestamp|tip_history   |weight
|+---+-++-++--++--+---+-++--+--+--+---+---++---++--+--+-++-+--+--+|01
 |01_4_0   
|876743b0-f5e7-4289-b13b-1a0404d94380|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null
  |null|null  |null 
  |null |null|null  |null  |null  
|null   |null   |null|null  
 |null|null  |null  |null |null|null
 |null  |null  ||01 |01_4_1 
  
|00923d1a-58fc-4d42-8953-4a47b47d738f|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null
  |null|null  |null 
  |null |null|null  |null  |null  
|null   |null   |null|null  
 |null|null  |null  |null |null|null
 |null  |null  ||20230510125841762  
|20230510125841762_1_2|b318c482-8e43-4614-bdab-80946d5a9f53|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|false
 
|b318c482-8e43-4614-bdab-80946d5a9f53|0.5285807377766387|0.12835359814395741|[CA]
 |12  |1047178778|521899450 
|driver-001|0.41394620067559684|0.08532822423986208|[42.25978252084417, 
USD]|[0, 0, 7, -91, -36]|[Canada]|2015-03-16|2015-03-16
|rider-001|-2845295541651788027|0|[[32.10533813167099, 
USD]]|0.59076524||01 |01_4_3   
|4dcc72d7-0878-41c2-a85d-3e6374b88bb8|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null
  |null|null  |null 
  |null |null|null  |null  |null  
|null   |null   |null|null  
 |null|null  |null  |null |null|null
 |null  |null  ||01 |01_4_4 
  
|cfa79530-fc9f-42de-a181-34c06e79d9c5|partition_path=2015-03-16|7e1dea56-e88c-4072-be61-f4ae01feaaa3_1-138-381_20230510125841762.parquet|null
  |null|null  |null 
  |null |null|null  |null  |null  
|null   |null   |null|null  
 |null|null  |null

[GitHub] [hudi] eyjian opened a new issue, #8686: [SUPPORT] is not a Parquet file (length is too low: 0)

2023-05-10 Thread via GitHub


eyjian opened a new issue, #8686:
URL: https://github.com/apache/hudi/issues/8686

   **Environment:**
   
   Hudi MOR+ Spark
   
   **Hudi version :**
   
   0.12.1
   
   **Spark version :**
   
   spark3
   
   **Hadoop version:**
   
   2.8.5
   
   **Problem:**
   
   java.lang.RuntimeException: 
hdfs://test/warehouse/test.db/t_test/20220811/b4a54eb9-0e0c-406a-a2bd-c2f07b021277-0_8-10-991_20230421093519585.parquet
 is not a Parquet file (length is too low: 0)
   
   **SQL:**
   
   `insert into table_a_rt select * from table_b_rt;`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190632083


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190631429


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190631130


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630713


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630254


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630112


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-10 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190629741


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),

Review Comment:
   Fixed.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import 

[GitHub] [hudi] danny0405 commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190629034


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatibl

2023-05-10 Thread via GitHub


BruceKellan commented on issue #8685:
URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543321785

   That my flink DDL:
   
   ```sql
   create table sink_table (
 uniqueKey string,
 key string,
 `offset` bigint,
 `time` bigint,
 sid int,
 plat string,
 pid string,
 gid string,
 account string,
 playerid string,
 kafka_ts bigint,
 consume_ts bigint,
 prop map,
 `day` string,
 `type` string
   ) partitioned by (`day`, `type`) with (
 'connector' = 'hudi',
 'path' = 'x',
 'table.type' = 'COPY_ON_WRITE',
 'write.operation' = 'insert',
 'write.tasks' = '12',
 'write.task.max.size' = '1024',
 'hive_sync.partition_fields' = 'day,type',
 'hive_sync.partition_extractor_class' = 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
 'hadoop.io.compression.codec.zstd.level' = '3',
 'hadoop.parquet.compression.codec.zstd.level' = '3',
 'hoodie.datasource.write.recordkey.field' = 'uniqueKey',
 'hoodie.datasource.write.partitionpath.field' = 'day,type',
 'hoodie.datasource.write.hive_style_partitioning' = 'true',
 'hoodie.datasource.write.keygenerator.type' = 'COMPLEX',
 'hoodie.parquet.compression.codec' = 'zstd',
 'hoodie.parquet.dictionary.enabled' = 'true',
 'write.insert.cluster' = 'false',
 'clean.retain_commits' = '30',
 'clustering.schedule.enabled' = 'true',
 'clustering.delta_commits' = '8',
 'clustering.async.enabled' = 'true',
 'clustering.plan.strategy.target.file.max.bytes' = '104857600',
 'clustering.plan.strategy.small.file.limit' = '30',
 'clustering.plan.strategy.max.num.groups' = '1500',
 'hoodie.clustering.plan.strategy.single.group.clustering.enabled' = 
'false',
 'hoodie.archive.merge.enable' = 'true',
 'hoodie.archive.merge.small.file.limit.bytes' = '536870912',
 'metadata.enabled' = 'false'
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatibl

2023-05-10 Thread via GitHub


BruceKellan commented on issue #8685:
URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543319977

   BTW, I have merge #8587 on our branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8675:
URL: https://github.com/apache/hudi/pull/8675#issuecomment-1543320126

   
   ## CI report:
   
   * eb510e9ae89ac152e29375c44becfadd02506674 UNKNOWN
   * f1376bc500e0b86fe595dbfa207d5bd77b8ac5ab UNKNOWN
   * fd1a22a9c708bf20c05bc4455121a6ba108b5869 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16997)
 
   * 875f77fd40a5b93dd7fcfa628e981f575372d6e7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17013)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8685: [SUPPORT] flink hudi-0.13.0 append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible

2023-05-10 Thread via GitHub


danny0405 commented on issue #8685:
URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543316591

   Did you create the table using Spark? We did a compatibility fix for Spark 
and Flink schema recently: https://github.com/apache/hudi/pull/8587, and 
another speculation is the primary key, the primary is not nulable for Flink 
definition but nullable for Spark(Spark does not have that constraint).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8675:
URL: https://github.com/apache/hudi/pull/8675#issuecomment-1543315375

   
   ## CI report:
   
   * eb510e9ae89ac152e29375c44becfadd02506674 UNKNOWN
   * f1376bc500e0b86fe595dbfa207d5bd77b8ac5ab UNKNOWN
   * fd1a22a9c708bf20c05bc4455121a6ba108b5869 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16997)
 
   * 875f77fd40a5b93dd7fcfa628e981f575372d6e7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8640:
URL: https://github.com/apache/hudi/pull/8640#discussion_r1190620668


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/SparkBootstrapCommitActionExecutor.java:
##
@@ -326,6 +326,8 @@ private Map>>> listAndPr
   if (!(selector instanceof FullRecordBootstrapModeSelector)) {
 FullRecordBootstrapModeSelector fullRecordBootstrapModeSelector = new 
FullRecordBootstrapModeSelector(config);
 result.putAll(fullRecordBootstrapModeSelector.select(folders));
+  } else {
+result.putAll(selector.select(folders));
   }

Review Comment:
   Can you elaborate a liitle what are we fixing here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190620224


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   Generally speaking, the current implementation of Hudi is worked as 
micro-batch data lake, not a really streaming lakehouse. Will we propose to 
build the really streaming lakehouse via Hudi?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190618910


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] hudi-bot commented on pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8640:
URL: https://github.com/apache/hudi/pull/8640#issuecomment-1543309415

   
   ## CI report:
   
   * f1367ab0be48cb01834d3ada76bf2198c620d38c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16848)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17012)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190617763


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190617763


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190616885


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] xushiyan commented on a diff in pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-10 Thread via GitHub


xushiyan commented on code in PR #8675:
URL: https://github.com/apache/hudi/pull/8675#discussion_r1190611852


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieMergedReadHandle.java:
##
@@ -38,113 +41,139 @@
 
 import org.apache.avro.generic.GenericRecord;
 import org.apache.avro.generic.IndexedRecord;
+import org.apache.spark.api.java.JavaRDD;
 import org.junit.jupiter.params.ParameterizedTest;
-import org.junit.jupiter.params.provider.EnumSource;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Collections;
+import java.util.Comparator;
 import java.util.List;
-import java.util.Properties;
 import java.util.stream.Collectors;
-
-import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields;
-import static org.apache.hudi.avro.HoodieAvroUtils.createHoodieRecordFromAvro;
-import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.AVRO_SCHEMA;
-import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.model.HoodieTableType.COPY_ON_WRITE;
+import static org.apache.hudi.common.model.HoodieTableType.MERGE_ON_READ;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA_STR;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA_WITH_METAFIELDS;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getDeletes;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getInserts;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getKeyGenProps;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getPayloadProps;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.getUpdates;
 import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC;
+import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class TestHoodieMergedReadHandle extends 
SparkClientFunctionalTestHarness {
 
-  private HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator();
+  private static Stream avroPayloadClasses() {
+return Stream.of(
+Arguments.of(COPY_ON_WRITE, OverwriteWithLatestAvroPayload.class),
+Arguments.of(COPY_ON_WRITE, 
OverwriteNonDefaultsWithLatestAvroPayload.class),
+Arguments.of(COPY_ON_WRITE, PartialUpdateAvroPayload.class),
+Arguments.of(COPY_ON_WRITE, DefaultHoodieRecordPayload.class),
+Arguments.of(MERGE_ON_READ, OverwriteWithLatestAvroPayload.class),
+Arguments.of(MERGE_ON_READ, 
OverwriteNonDefaultsWithLatestAvroPayload.class),
+Arguments.of(MERGE_ON_READ, PartialUpdateAvroPayload.class),
+Arguments.of(MERGE_ON_READ, DefaultHoodieRecordPayload.class)

Review Comment:
   purposely left out AWSDmsPayload and debezium payload cases 
https://github.com/apache/hudi/pull/8675/commits/875f77fd40a5b93dd7fcfa628e981f575372d6e7
   to land this patch. When doing proper fix for payload, those classes can be 
added here to cover custom delete marker scenario.



##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieMergedReadHandle.java:
##
@@ -38,113 +41,139 @@
 
 import org.apache.avro.generic.GenericRecord;
 import org.apache.avro.generic.IndexedRecord;
+import org.apache.spark.api.java.JavaRDD;
 import org.junit.jupiter.params.ParameterizedTest;
-import org.junit.jupiter.params.provider.EnumSource;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Collections;
+import java.util.Comparator;
 import java.util.List;
-import java.util.Properties;
 import java.util.stream.Collectors;
-
-import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields;
-import static org.apache.hudi.avro.HoodieAvroUtils.createHoodieRecordFromAvro;
-import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.AVRO_SCHEMA;
-import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.model.HoodieTableType.COPY_ON_WRITE;
+import static org.apache.hudi.common.model.HoodieTableType.MERGE_ON_READ;
+import static 
org.apache.hudi.common.testutils.HoodieAdaptablePayloadDataGenerator.SCHEMA;
+import static 

[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190608619


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606838


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606600


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.

Review Comment:
   Well, ingest is completely incremental now - across industry. Once upon a 
time, it was unthinkable. :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190606252


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] danny0405 closed issue #8499: [SUPPORT] Support partial insert in merge into command

2023-05-10 Thread via GitHub


danny0405 closed issue #8499: [SUPPORT] Support partial insert in merge into 
command
URL: https://github.com/apache/hudi/issues/8499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (65172d3d66a -> 9963b50ee17)

2023-05-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 65172d3d66a [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 
(#8082)
 add 9963b50ee17 [HUDI-6105] Support partial insert in MERGE INTO command 
(#8597)

No new revisions were added by this update.

Summary of changes:
 .../hudi/command/MergeIntoHoodieTableCommand.scala | 38 +++--
 .../apache/spark/sql/hudi/TestMergeIntoTable.scala | 39 ++
 2 files changed, 66 insertions(+), 11 deletions(-)



[GitHub] [hudi] danny0405 merged pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command

2023-05-10 Thread via GitHub


danny0405 merged PR #8597:
URL: https://github.com/apache/hudi/pull/8597


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190605292


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] danny0405 commented on pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command

2023-05-10 Thread via GitHub


danny0405 commented on PR #8597:
URL: https://github.com/apache/hudi/pull/8597#issuecomment-1543289530

   The failed test: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=16733=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7=10927
 `testArchivalWithMultiWriters` is falky and can not reproduce in local, would 
merge it soon~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604359


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.

Review Comment:
   +1 . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604259


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] vinothchandar commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


vinothchandar commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190604067


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model 

[GitHub] [hudi] weimingdiit commented on pull request #8640: [HUDI-6107] Fix java.lang.IllegalArgumentException for bootstrap

2023-05-10 Thread via GitHub


weimingdiit commented on PR #8640:
URL: https://github.com/apache/hudi/pull/8640#issuecomment-1543284008

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-10 Thread via GitHub


nsivabalan commented on code in PR #8675:
URL: https://github.com/apache/hudi/pull/8675#discussion_r1190600637


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieAdaptablePayloadDataGenerator.java:
##
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.AWSDmsAvroPayload;
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.MetadataValues;
+import org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.common.model.PartialUpdateAvroPayload;
+import org.apache.hudi.common.model.debezium.DebeziumConstants;
+import org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload;
+import org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+import static 
org.apache.hudi.common.model.HoodieRecord.HOODIE_IS_DELETED_FIELD;
+import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
+
+public class HoodieAdaptablePayloadDataGenerator {
+
+  public static final Schema SCHEMA = 
SchemaTestUtil.getSchemaFromResource(HoodieAdaptablePayloadDataGenerator.class, 
"/adaptable-payload.avsc");
+  public static final Schema SCHEMA_WITH_METAFIELDS = 
HoodieAvroUtils.addMetadataFields(SCHEMA, false);
+  public static final String SCHEMA_STR = SCHEMA.toString();
+
+  public static Properties getKeyGenProps(Class payloadClass) {
+String orderingField = new RecordGen(payloadClass).getOrderingField();
+Properties props = new Properties();
+props.put("hoodie.datasource.write.recordkey.field", "id");
+props.put("hoodie.datasource.write.partitionpath.field", "pt");
+props.put("hoodie.datasource.write.precombine.field", orderingField);
+props.put(HoodieTableConfig.RECORDKEY_FIELDS.key(), "id");
+props.put(HoodieTableConfig.PARTITION_FIELDS.key(), "pt");
+props.put(HoodieTableConfig.PRECOMBINE_FIELD.key(), orderingField);
+return props;
+  }
+
+  public static Properties getPayloadProps(Class payloadClass) {
+String orderingField = new RecordGen(payloadClass).getOrderingField();
+Properties props = new Properties();
+props.put("hoodie.compaction.payload.class", payloadClass.getName());
+props.put("hoodie.payload.event.time.field", orderingField);
+props.put("hoodie.payload.ordering.field", orderingField);
+return props;
+  }
+
+  public static List getInserts(int n, String partition, long 
ts, Class payloadClass) throws IOException {
+return getInserts(n, new String[] {partition}, ts, payloadClass);
+  }
+
+  public static List getInserts(int n, String[] partitions, long 
ts, Class payloadClass) throws IOException {
+List inserts = new ArrayList<>();
+RecordGen recordGen = new RecordGen(payloadClass);
+for (GenericRecord r : getInserts(n, partitions, ts, recordGen)) {
+  inserts.add(getHoodieRecord(r, recordGen.getPayloadClass()));
+}
+return inserts;
+  }
+
+  private static List getInserts(int n, String[] partitions, 
long ts, RecordGen recordGen) {
+return IntStream.range(0, n).mapToObj(id -> {
+  String pt = partitions.length == 0 ? "" : partitions[id % 
partitions.length];
+  return getInsert(id, pt, ts, recordGen);
+}).collect(Collectors.toList());
+  }
+
+  private static GenericRecord getInsert(int id, String pt, long ts, RecordGen 
recordGen) {
+GenericRecord r = new GenericData.Record(SCHEMA);
+

[GitHub] [hudi] BruceKellan commented on issue #8685: [SUPPORT] Flink append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible types: req

2023-05-10 Thread via GitHub


BruceKellan commented on issue #8685:
URL: https://github.com/apache/hudi/issues/8685#issuecomment-1543272627

   @danny0405  can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BruceKellan opened a new issue, #8685: [SUPPORT] Flink append+clustering mode, clustering will occur The requested schema is not compatible with the file schema. incompatible types: re

2023-05-10 Thread via GitHub


BruceKellan opened a new issue, #8685:
URL: https://github.com/apache/hudi/issues/8685

   **To Reproduce**
   
   I am using flink 1.13.6 and hudi 0.13.0. When aysnc clustering job 
scheduled, will throw exception:
   
   ```java
   2023-05-11 11:06:35,604 ERROR 
org.apache.hudi.sink.clustering.ClusteringOperator   [] - Executor 
executes action [Execute clustering for instant 2023050411858 from task 1] 
error
   org.apache.hudi.exception.HoodieException: unable to read next record from 
parquet file 
at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:53)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.common.util.MappingIterator.hasNext(MappingIterator.java:35) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.common.util.MappingIterator.hasNext(MappingIterator.java:35) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811) 
~[?:1.8.0_332]
at 
java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:295)
 ~[?:1.8.0_332]
at 
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:207)
 ~[?:1.8.0_332]
at 
java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:162)
 ~[?:1.8.0_332]
at 
java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:301)
 ~[?:1.8.0_332]
at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) 
~[?:1.8.0_332]
at 
org.apache.hudi.client.utils.ConcatenatingIterator.hasNext(ConcatenatingIterator.java:45)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.sink.clustering.ClusteringOperator.doClustering(ClusteringOperator.java:261)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.sink.clustering.ClusteringOperator.lambda$processElement$0(ClusteringOperator.java:194)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_332]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_332]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
   Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file 
oss://xxx/hudi/datalog/today/db/table/day=2023-05-11/type=aa/7b1a5921-1d37-435e-b74a-0c0a5356b7bc-20_5-8-0_20230511105641808.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:48)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
... 15 more
   Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
schema is not compatible with the file schema. incompatible types: required 
binary key (STRING) != optional binary key (STRING)
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:69)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at org.apache.parquet.schema.GroupType.accept(GroupType.java:256) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:69)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at org.apache.parquet.schema.GroupType.accept(GroupType.java:256) 
~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:83)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
 ~[hudi-flink1.13-bundle-0.13.0.jar:0.13.0]
at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) 

[GitHub] [hudi] danny0405 commented on pull request #8657: [HUDI-6150] Support bucketing for each hive client

2023-05-10 Thread via GitHub


danny0405 commented on PR #8657:
URL: https://github.com/apache/hudi/pull/8657#issuecomment-1543269796

   > Hardcoding Murmur is likely a good idea
   
   Not hardcoding, I mean to make it configurable, the use choose the algorithm 
they desire to use.
   
   > it would allow to support both spark 2 and all spark 3 releases.
   
   We can dig that further, but hitherto I would rather keep it simple as first.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8682:
URL: https://github.com/apache/hudi/pull/8682#discussion_r1190589930


##
.github/workflows/bot.yml:
##
@@ -27,20 +27,8 @@ jobs:
 strategy:
   matrix:
 include:
-  - scalaProfile: "scala-2.11"
-sparkProfile: "spark2.4"
-
-  - scalaProfile: "scala-2.12"
-sparkProfile: "spark2.4"
-
-  - scalaProfile: "scala-2.12"
-sparkProfile: "spark3.1"
-
-  - scalaProfile: "scala-2.12"
-sparkProfile: "spark3.2"
-
   - scalaProfile: "scala-2.12"
-sparkProfile: "spark3.3"
+sparkProfile: "spark3.4"

Review Comment:
   Why we have so many changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5868) Upgrade Spark to 3.3.2

2023-05-10 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5868.

Resolution: Fixed

Fixed via master branch: 65172d3d66ad299f31b0f21ba004024236006634

> Upgrade Spark to 3.3.2
> --
>
> Key: HUDI-5868
> URL: https://issues.apache.org/jira/browse/HUDI-5868
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Rahil Chertara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5868) Upgrade Spark to 3.3.2

2023-05-10 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-5868:
-
Fix Version/s: 0.13.1
   0.14.0

> Upgrade Spark to 3.3.2
> --
>
> Key: HUDI-5868
> URL: https://issues.apache.org/jira/browse/HUDI-5868
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Rahil Chertara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (5bab66498c8 -> 65172d3d66a)

2023-05-10 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 5bab66498c8 [HUDI-6122] Unify call procedure options (#8537)
 add 65172d3d66a [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 
(#8082)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java| 2 ++
 .../src/main/scala/org/apache/hudi/HoodieSparkUtils.scala   | 1 +
 .../src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala   | 5 -
 .../datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala| 6 +-
 pom.xml | 2 ++
 5 files changed, 14 insertions(+), 2 deletions(-)



[GitHub] [hudi] danny0405 merged pull request #8082: [HUDI-5868] Make hudi-spark compatible against Spark 3.3.2

2023-05-10 Thread via GitHub


danny0405 merged PR #8082:
URL: https://github.com/apache/hudi/pull/8082


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] houhang1005 closed pull request #8665: [HUDI-6190] Fix the default value of RECORD_KEY_FIELD.

2023-05-10 Thread via GitHub


houhang1005 closed pull request #8665: [HUDI-6190] Fix the default value of 
RECORD_KEY_FIELD.
URL: https://github.com/apache/hudi/pull/8665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8659: [HUDI-6155] Fix cleaner based on hours for earliest commit to retain

2023-05-10 Thread via GitHub


danny0405 commented on code in PR #8659:
URL: https://github.com/apache/hudi/pull/8659#discussion_r1190585368


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -144,4 +144,10 @@ public static boolean isValidInstantTime(String 
instantTime) {
   return false;
 }
   }
+
+  private static ZoneId getZoneId() {
+return commitTimeZone.equals(HoodieTimelineTimeZone.LOCAL)
+? ZoneId.systemDefault()

Review Comment:
   See the discussions we take in: https://github.com/apache/hudi/pull/8631



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jenu9417 commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

2023-05-10 Thread via GitHub


jenu9417 commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1543243868

   @nsivabalan / @HEPBO3AH  Will check in the new version, if this issue is 
fixed. 
   
   But I also wanted to understand the correlation between various types of API 
hits (specifically LIST and HEAD) per write to 1 partition. Like for each write 
to 1 partition, how many GET, HEAD, PUT, LIST operations are happening. This 
will help us to do cost estimation effectively for the project.
   
   Can you please provide some insights here? Or any corresponding 
documentation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] beyond1920 commented on a diff in pull request #8597: [HUDI-6105] Support partial insert in MERGE INTO command

2023-05-10 Thread via GitHub


beyond1920 commented on code in PR #8597:
URL: https://github.com/apache/hudi/pull/8597#discussion_r1190574858


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -467,11 +471,19 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
   case None =>
 // In case partial assignments are allowed and there's no 
corresponding conditional assignment,
 // create a self-assignment for the target table's attribute
-if (allowPartialAssignments) {
-  Assignment(attr, attr)
-} else {
-  throw new AnalysisException(s"Assignment expressions have to 
assign every attribute of target table " +
-s"(provided: `${assignments.map(_.sql).mkString(",")}`")
+partialAssigmentMode match {
+  case Some(mode) =>
+mode match {
+  case PartialAssignmentMode.NULL_VALUE =>
+Assignment(attr, Literal(null))
+  case PartialAssignmentMode.ORIGINAL_VALUE =>
+Assignment(attr, attr)
+  case PartialAssignmentMode.DEFAULT_VALUE =>
+Assignment(attr, Literal.default(attr.dataType))
+}
+  case _ =>
+throw new AnalysisException(s"Assignment expressions have to 
assign every attribute of target table " +

Review Comment:
   `Delete` would not hit this branch.
   
![image](https://github.com/apache/hudi/assets/1525333/8917cb76-fc4d-4077-9a17-39f7e6d89a8a)
   
![image](https://github.com/apache/hudi/assets/1525333/b642f4ef-fc5a-4dd1-a87a-dc27553f886c)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6122] Unify call procedure options (#8537)

2023-05-10 Thread biyan
This is an automated email from the ASF dual-hosted git repository.

biyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 5bab66498c8 [HUDI-6122] Unify call procedure options (#8537)
5bab66498c8 is described below

commit 5bab66498c857c32e042488db72a018b83ea926a
Author: Zouxxyy 
AuthorDate: Thu May 11 09:55:32 2023 +0800

[HUDI-6122] Unify call procedure options (#8537)
---
 .../scala/org/apache/hudi/HoodieCLIUtils.scala |   8 +
 .../procedures/ArchiveCommitsProcedure.scala   |   4 +-
 .../hudi/command/procedures/BaseProcedure.scala|  48 ++--
 .../procedures/CommitsCompareProcedure.scala   |   4 +-
 .../command/procedures/CopyToTableProcedure.scala  |   4 +-
 .../hudi/command/procedures/CopyToTempView.scala   |   4 +-
 .../procedures/CreateMetadataTableProcedure.scala  |   2 +-
 .../procedures/CreateSavepointProcedure.scala  |   6 +-
 .../command/procedures/DeleteMarkerProcedure.scala |   4 +-
 .../procedures/DeleteMetadataTableProcedure.scala  |   2 +-
 .../procedures/DeleteSavepointProcedure.scala  |   6 +-
 .../procedures/ExportInstantsProcedure.scala   |   4 +-
 .../procedures/HdfsParquetImportProcedure.scala|  14 +-
 .../hudi/command/procedures/HelpProcedure.scala|   4 +-
 .../command/procedures/HiveSyncProcedure.scala |   2 +-
 .../procedures/InitMetadataTableProcedure.scala|   2 +-
 .../command/procedures/ProcedureParameter.scala|   7 +-
 .../RepairAddpartitionmetaProcedure.scala  |   2 +-
 .../RepairCorruptedCleanFilesProcedure.scala   |   2 +-
 .../procedures/RepairDeduplicateProcedure.scala|   6 +-
 .../RepairMigratePartitionMetaProcedure.scala  |   2 +-
 .../RepairOverwriteHoodiePropsProcedure.scala  |   4 +-
 .../RollbackToInstantTimeProcedure.scala   |   4 +-
 .../procedures/RollbackToSavepointProcedure.scala  |   6 +-
 .../command/procedures/RunBootstrapProcedure.scala |  10 +-
 .../command/procedures/RunCleanProcedure.scala |  48 ++--
 .../procedures/RunClusteringProcedure.scala|  34 ++-
 .../procedures/RunCompactionProcedure.scala|  15 +-
 .../procedures/ShowArchivedCommitsProcedure.scala  |   2 +-
 .../procedures/ShowBootstrapMappingProcedure.scala |   2 +-
 .../ShowBootstrapPartitionsProcedure.scala |   2 +-
 .../procedures/ShowClusteringProcedure.scala   |   4 +-
 .../ShowCommitExtraMetadataProcedure.scala |   6 +-
 .../procedures/ShowCommitFilesProcedure.scala  |   4 +-
 .../procedures/ShowCommitPartitionsProcedure.scala |   4 +-
 .../procedures/ShowCommitWriteStatsProcedure.scala |   4 +-
 .../command/procedures/ShowCommitsProcedure.scala  |   2 +-
 .../procedures/ShowCompactionProcedure.scala   |   4 +-
 .../procedures/ShowFileSystemViewProcedure.scala   |   6 +-
 .../procedures/ShowFsPathDetailProcedure.scala |   2 +-
 .../ShowHoodieLogFileMetadataProcedure.scala   |   4 +-
 .../ShowHoodieLogFileRecordsProcedure.scala|   4 +-
 .../procedures/ShowInvalidParquetProcedure.scala   |   2 +-
 .../ShowMetadataTableFilesProcedure.scala  |   2 +-
 .../ShowMetadataTablePartitionsProcedure.scala |   2 +-
 .../ShowMetadataTableStatsProcedure.scala  |   2 +-
 .../procedures/ShowRollbacksProcedure.scala|   6 +-
 .../procedures/ShowSavepointsProcedure.scala   |   4 +-
 .../procedures/ShowTablePropertiesProcedure.scala  |   4 +-
 .../procedures/StatsFileSizeProcedure.scala|   2 +-
 .../StatsWriteAmplificationProcedure.scala |   2 +-
 .../procedures/UpgradeOrDowngradeProcedure.scala   |   4 +-
 .../procedures/ValidateHoodieSyncProcedure.scala   |  10 +-
 .../ValidateMetadataTableFilesProcedure.scala  |   2 +-
 .../sql/hudi/procedure/TestCleanProcedure.scala| 269 ++---
 .../hudi/procedure/TestCompactionProcedure.scala   |  45 
 56 files changed, 403 insertions(+), 261 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
index 5f0cba6fd7c..c9f5a8a1215 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCLIUtils.scala
@@ -22,6 +22,7 @@ package org.apache.hudi
 import org.apache.hudi.avro.model.HoodieClusteringGroup
 import org.apache.hudi.client.SparkRDDWriteClient
 import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.StringUtils
 import org.apache.spark.SparkException
 import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.sql.SparkSession
@@ -90,4 +91,11 @@ object HoodieCLIUtils {
 throw new SparkException(s"Unsupported identifier $table")
 }
   }
+
+  def 

[GitHub] [hudi] YannByron merged pull request #8537: [HUDI-6122] Unify options in call procedure

2023-05-10 Thread via GitHub


YannByron merged PR #8537:
URL: https://github.com/apache/hudi/pull/8537


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BruceKellan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


BruceKellan commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190542924


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1543040480

   
   ## CI report:
   
   * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17006)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542977709

   
   ## CI report:
   
   * e2785f4675ddf74582ff34590608a5d71c5e9a2d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17008)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542969346

   
   ## CI report:
   
   * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17007)
 
   * e2785f4675ddf74582ff34590608a5d71c5e9a2d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] samserpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-10 Thread via GitHub


samserpoosh commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1542967885

   @the-other-tim-brown There's a good chance that this is caused by the 
**input Kafka topic's events** and how they're serialized/deserialized (i.e. 
the way Debezium Connector is currently shaping and publishing the change-log 
messages to Kafka).
   
   I leveraged the `kafka-avro-console-consumer` that comes with Confluent's 
Schema Registry, and here's how my dummy/test table's change-log events look 
like:
   
   ```json
   {
 "before": null,
 "after": {
   "..samser_customers.Value": {
 "id": 1,
 "name": "Bob",
 "age": 40,
 "created_at": {
   "long": 1683661733071814
 },
 "event_ts": {
   "long": 168198480
 }
   }
 },
 "source": {
   "version": "2.1.2.Final",
   "connector": "postgresql",
   "name": "pg_dev8",
   "ts_ms": 1683734195621,
   "snapshot": {
 "string": "first_in_data_collection"
   },
   "db": "",
   "sequence": {
 "string": "[null,\"1213462492184\"]"
   },
   "schema": "public",
   "table": "samser_customers",
   "txId": {
 "long": 806227
   },
   "lsn": {
 "long": 1213462492184
   },
   "xmin": null
 },
 "op": "r",
 "ts_ms": {
   "long": 1683734196050
 },
 "transaction": null
   }
   ```
   
   And here's the corresponding schema which was established in the Schema 
Registry:
   
   ```json
   {
 "type": "record",
 "name": "Envelope",
 "namespace": "..samser_customers",
 "fields": [
   {
 "name": "before",
 "type": [
   "null",
   {
 "type": "record",
 "name": "Value",
 "fields": [
   {
 "name": "id",
 "type": {
   "type": "int",
   "connect.default": 0
 },
 "default": 0
   },
   {
 "name": "name",
 "type": "string"
   },
   {
 "name": "age",
 "type": "int"
   },
   {
 "name": "created_at",
 "type": [
   "null",
   {
 "type": "long",
 "connect.version": 1,
 "connect.name": "io.debezium.time.MicroTimestamp"
   }
 ],
 "default": null
   },
   {
 "name": "event_ts",
 "type": [
   "null",
   "long"
 ],
 "default": null
   }
 ],
 "connect.name": 
"..samser_customers.Value"
   }
 ],
 "default": null
   },
   {
 "name": "after",
 "type": [
   "null",
   "Value"
 ],
 "default": null
   },
   {
 "name": "source",
 "type": {
   "type": "record",
   "name": "Source",
   "namespace": "io.debezium.connector.postgresql",
   "fields": [
 {
   "name": "version",
   "type": "string"
 },
 {
   "name": "connector",
   "type": "string"
 },
 {
   "name": "name",
   "type": "string"
 },
 {
   "name": "ts_ms",
   "type": "long"
 },
 {
   "name": "snapshot",
   "type": [
 {
   "type": "string",
   "connect.version": 1,
   "connect.parameters": {
 "allowed": "true,last,false,incremental"
   },
   "connect.default": "false",
   "connect.name": "io.debezium.data.Enum"
 },
 "null"
   ],
   "default": "false"
 },
 {
   "name": "db",
   "type": "string"
 },
 {
   "name": "sequence",
   "type": [
 "null",
 "string"
   ],
   "default": null
 },
 {
   "name": "schema",
   "type": "string"
 },
 {
   "name": "table",
   "type": "string"
 },
 {
   "name": "txId",
   "type": [
 "null",
 "long"
   ],
   "default": null
 },
 {
   "name": "lsn",
   "type": [
 "null",
 "long"
  

[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542965347

   
   ## CI report:
   
   * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN
   * d3756a68d846716a0ebfc6ae546249fe362e7d6f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17005)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542964544

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542935795

   
   ## CI report:
   
   * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004)
 
   * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN
   * d3756a68d846716a0ebfc6ae546249fe362e7d6f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542935837

   
   ## CI report:
   
   * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17007)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1542935815

   
   ## CI report:
   
   * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17006)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1542931783

   
   ## CI report:
   
   * 35148aeb4ba78eb6f3316c75f8a0a7e4c6d6df87 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8683: [HUDI-5533] Support spark columns comments

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8683:
URL: https://github.com/apache/hudi/pull/8683#issuecomment-1542931756

   
   ## CI report:
   
   * 7bdb94998ee2853e15de0b4ce6c20735f43a0f5c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542931724

   
   ## CI report:
   
   * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004)
 
   * c23f6ed02a81dfac0d218cee75d18fee3a9b31df UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8643:
URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542927620

   
   ## CI report:
   
   * d415e503be584d30b784eade8cd8a63e13f81457 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17001)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6200) Enhancements to the MDT for improving performance of larger indexes

2023-05-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6200:
-
Labels: pull-request-available  (was: )

> Enhancements to the MDT for improving performance of larger indexes
> ---
>
> Key: HUDI-6200
> URL: https://issues.apache.org/jira/browse/HUDI-6200
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] prashantwason opened a new pull request, #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-10 Thread via GitHub


prashantwason opened a new pull request, #8684:
URL: https://github.com/apache/hudi/pull/8684

   [HUDI-6200] Enhancements to the MDT for improving performance of larger 
indexes.
   
   ### Change Logs
   
   TBD
   
   ### Impact
   
   TBD
   
   ### Risk level (write none, low medium or high below)
   
   TBD
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6200) Enhancements to the MDT for improving performance of larger indexes

2023-05-10 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-6200:


 Summary: Enhancements to the MDT for improving performance of 
larger indexes
 Key: HUDI-6200
 URL: https://issues.apache.org/jira/browse/HUDI-6200
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] dineshbganesan commented on issue #8667: Table exception after enabling inline clustering

2023-05-10 Thread via GitHub


dineshbganesan commented on issue #8667:
URL: https://github.com/apache/hudi/issues/8667#issuecomment-1542902319

   @ad1happy2go I'm using AWS Glue job to process the data which comes with the 
default version 0.12.1. I am not sure how to use a patch with Glue job. Can you 
clarify? 
   
   The logs show 3 different exceptions:
   
   - org.apache.hudi.exception.HoodieUpsertException: Error upserting 
bucketType UPDATE for partition :109
   
   - org.apache.hudi.exception.HoodieException: unable to read next record from 
parquet file
   
   - org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
s3://datalake-curatedtestng/ccbmod/cisadm/ci_ft/table_name=CI_FT/3a6a1bb9-0ba2-485c-8c13-a4020259ee5d-1_4-552-3668_20230506151911182.parquet
   
   Can you help me understand what's the root cause of these exceptions?
   
   Regards,
   Dinesh


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5533) Table comments not showing up on spark-sql describe

2023-05-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5533:
-
Labels: pull-request-available  (was: )

> Table comments not showing up on spark-sql describe
> ---
>
> Key: HUDI-5533
> URL: https://issues.apache.org/jira/browse/HUDI-5533
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Priority: Minor
>  Labels: pull-request-available
>
> If you add a comment to the schema and write to a hudi table, the comment 
> will show as null when using spark-sql describe on the table.
>  
> User reported issue [https://github.com/apache/hudi/issues/7531] with a very 
> good reproducible example. The issue presented when I tried the example.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] parisni opened a new pull request, #8683: [HUDI-5533] Support spark columns comments

2023-05-10 Thread via GitHub


parisni opened a new pull request, #8683:
URL: https://github.com/apache/hudi/pull/8683

   ### Change Logs
   
   fixes #7531 ie: show comments within spark schemas
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542885439

   
   ## CI report:
   
   * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-10 Thread via GitHub


the-other-tim-brown commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1190425359


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {

Review Comment:
   I don't think it makes sense for this to return an Option. All rows will 
have a schema of some sorts so this option would never be empty in practice.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   The default here in my opinion should create an empty dataset with the 
`incomingSchema` and then apply the transformer and call `.schema()` on the 
resulting dataset to get the struct type and convert that back to avro. 
   
   Another note, since transforms deal with Rows, does it make more sense to 
track the schema as a StructType?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Sam-Serpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-10 Thread via GitHub


Sam-Serpoosh commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1542839452

   > for the first issue regarding the schema, this is because we are fetching 
that schema as a string. If that class is not defined in the string, we won't 
know how it is defined. Maybe there is some arg to pass to the api to get the 
schemas that this schema relies on as well?
   
   @the-other-tim-brown That makes perfect sense and I ended up resolving that 
issue by simply using **Confluent Schema Registry** instead of the `Apicurio` I 
was previously using. Since Confluent's includes everything correctly in **one 
place** so Hudi/DeltaStreamer can fetch it in one-swoop properly.
   
   > For the second, it is hard to tell without looking at your data. If you 
pull the data locally and step through, you may have a better shot of 
understanding. The main thing I have seen trip people up is the requirements 
for the delete records in the topic. You can also try out the same patch Sydney 
posted above for filtering out the tombstones in kafka.
   
   I **highly** doubt in my case it's caused by a tombstone record or the like. 
Because I'm testing this Data-Flow on a dummy/test Postgres table to which I've 
**only** applied `INSERT` operations so far.
   
   And BTW, I could **successfully** get a **vanilla Kafka ingestion** running 
end-to-end and populate a partitioned Hudi table as expected. So definitely the 
issue is specific to when I switch to `PostgresDebeziumSource` and 
`PostgresDebeziumAvroPayload`.
   
   Thank you very much for your input. I'll try to see what's the best way to 
debug this and how to figure out what's causing the exception I shared above 
when it comes to DeltaStreamer <> Debezium ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542840841

   
   ## CI report:
   
   * 5369ed017405d0484e5913d184a96fcd958a2a17 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17004)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542839511

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002)
 
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8682:
URL: https://github.com/apache/hudi/pull/8682#issuecomment-1542833998

   
   ## CI report:
   
   * 5369ed017405d0484e5913d184a96fcd958a2a17 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542832656

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002)
 
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * a95196cb0c749c1e1e8fb245a2a58d429159d519 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542824483

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17002)
 
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-05-10 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519

   It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur 
with 0.13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy closed issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-05-10 Thread via GitHub


kazdy closed issue #8259: [SUPPORT] Clustering created files with modified 
schema resulting in corrupted table 
URL: https://github.com/apache/hudi/issues/8259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


nsivabalan commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190389085


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


nsivabalan commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190108469


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.

Review Comment:
   you started to call "batch" as old school   



##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental 

[GitHub] [hudi] kazdy commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


kazdy commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542814478

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6198) Support Spark 3.4.0

2023-05-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6198:
-
Labels: pull-request-available  (was: )

> Support Spark 3.4.0
> ---
>
> Key: HUDI-6198
> URL: https://issues.apache.org/jira/browse/HUDI-6198
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
>
> Support Spark 3.4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] mansipp opened a new pull request, #8682: [HUDI-6198] Spark 3.4.0 Upgrade

2023-05-10 Thread via GitHub


mansipp opened a new pull request, #8682:
URL: https://github.com/apache/hudi/pull/8682

   ### Change Logs
   
   Changes to support Spark 3.3.0
   
   ### Impact
   
Upgrade Spark to  3.4.0
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   Need doc update
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5820) Improve Azure and GH CI's maven build with cache (3.9+)

2023-05-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5820:
-
Labels: pull-request-available  (was: )

> Improve Azure and GH CI's maven build with cache (3.9+)
> ---
>
> Key: HUDI-5820
> URL: https://issues.apache.org/jira/browse/HUDI-5820
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Refer to PR https://github.com/apache/hudi/pull/7935
> For Azure, we can try downloading and installing maven 3.9 and use the custom 
> maven in the maven@4 task.
> For GH actions CI, more investigation needed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] kazdy closed pull request #7935: [HUDI-5820] Add maven-build-cache-extension

2023-05-10 Thread via GitHub


kazdy closed pull request #7935: [HUDI-5820] Add maven-build-cache-extension
URL: https://github.com/apache/hudi/pull/7935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-10 Thread via GitHub


kazdy commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1190332409


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for Hudi 

[jira] [Updated] (HUDI-6199) CDC payload with op field for deletes do not work

2023-05-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6199:

Description: 
Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom 
payload implementation like AWS DMS payload and Debezium payload are not 
properly migrated to the new APIs introduced by RFC-46, causing the delete 
operation to fail.  Our tests did not catch this.  
 
It is currently assumed that delete records are marked by "_hoodie_is_deleted"; 
however, custom CDC payloads use op field to mark deletes.
 
Impact:
OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) are 
not affected.

for any other custom payloads: (AWSDMSAvropayload, All debezium payloads) 
deletes are broken. 
If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
issues w/ custome payloads.

COW: 
deleting a non-existant will break if not using "_is_hoodie_deleted" way.

MOR: 
any deletes will break if not using "_is_hoodie_deleted" way.

Writer:
all writers(spark, flink) except spark-sql.

DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.

  was:
Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom 
payload implementation like AWS DMS payload and Debezium payload are not 
properly migrated to the new APIs introduced by RFC-46, causing the delete 
operation to fail.  Our tests did not catch this.  
 
It is currently assumed that delete records are marked by "_hoodie_is_deleted"; 
however, custom CDC payloads use op field to mark deletes.
 
Impact:
OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload)
no issues. 

for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) 
deletes are broken. 
If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
issues w/ custome payloads. 

COW: 
deleting a non-existant will break if not using "_is_hoodie_deleted" way. 

MOR: 
any deletes will break if not using "_is_hoodie_deleted" way. 

Writer:
all writers(spark, flink) except spark-sql.

DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.


> CDC payload with op field for deletes do not work
> -
>
> Key: HUDI-6199
> URL: https://issues.apache.org/jira/browse/HUDI-6199
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
> release [blocker|https://github.com/apache/hudi/pull/8573], I found that 
> custom payload implementation like AWS DMS payload and Debezium payload are 
> not properly migrated to the new APIs introduced by RFC-46, causing the 
> delete operation to fail.  Our tests did not catch this.  
>  
> It is currently assumed that delete records are marked by 
> "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark 
> deletes.
>  
> Impact:
> OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) 
> are not affected.
> for any other custom payloads: (AWSDMSAvropayload, All debezium payloads) 
> deletes are broken. 
> If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
> issues w/ custome payloads.
> COW: 
> deleting a non-existant will break if not using "_is_hoodie_deleted" way.
> MOR: 
> any deletes will break if not using "_is_hoodie_deleted" way.
> Writer:
> all writers(spark, flink) except spark-sql.
> DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-10 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1542777240

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * c078f0d7a1a0efe7d8a0674d6f3aeff333febd04 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15764)
 
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8643:
URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542770867

   
   ## CI report:
   
   * dc7c3bf6c199ff40a02058d1cd58a6853153b7eb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17000)
 
   * d415e503be584d30b784eade8cd8a63e13f81457 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17001)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs

2023-05-10 Thread via GitHub


hudi-bot commented on PR #8643:
URL: https://github.com/apache/hudi/pull/8643#issuecomment-1542762628

   
   ## CI report:
   
   * dc7c3bf6c199ff40a02058d1cd58a6853153b7eb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17000)
 
   * d415e503be584d30b784eade8cd8a63e13f81457 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7057: [SUPPORT] [OCC] HoodieException: Error getting all file groups in pending clustering

2023-05-10 Thread via GitHub


nsivabalan commented on issue #7057:
URL: https://github.com/apache/hudi/issues/7057#issuecomment-1542762155

   hey @KnightChess : sorry, we did not get to triage this. Are you still 
facing issues. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6199) CDC payload with op field for deletes do not work

2023-05-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6199:

Description: 
Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
release [blocker|https://github.com/apache/hudi/pull/8573], I found that custom 
payload implementation like AWS DMS payload and Debezium payload are not 
properly migrated to the new APIs introduced by RFC-46, causing the delete 
operation to fail.  Our tests did not catch this.  
 
It is currently assumed that delete records are marked by "_hoodie_is_deleted"; 
however, custom CDC payloads use op field to mark deletes.
 
Impact:
OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload)
no issues. 

for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) 
deletes are broken. 
If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
issues w/ custome payloads. 

COW: 
deleting a non-existant will break if not using "_is_hoodie_deleted" way. 

MOR: 
any deletes will break if not using "_is_hoodie_deleted" way. 

Writer:
all writers(spark, flink) except spark-sql.

DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.

> CDC payload with op field for deletes do not work
> -
>
> Key: HUDI-6199
> URL: https://issues.apache.org/jira/browse/HUDI-6199
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
> release [blocker|https://github.com/apache/hudi/pull/8573], I found that 
> custom payload implementation like AWS DMS payload and Debezium payload are 
> not properly migrated to the new APIs introduced by RFC-46, causing the 
> delete operation to fail.  Our tests did not catch this.  
>  
> It is currently assumed that delete records are marked by 
> "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark 
> deletes.
>  
> Impact:
> OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload)
> no issues. 
> for any other custom payloads: (AWSDMSAvropayload, All debezium payloads, ) 
> deletes are broken. 
> If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
> issues w/ custome payloads. 
> COW: 
> deleting a non-existant will break if not using "_is_hoodie_deleted" way. 
> MOR: 
> any deletes will break if not using "_is_hoodie_deleted" way. 
> Writer:
> all writers(spark, flink) except spark-sql.
> DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >