[GitHub] [hudi] danny0405 commented on pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-11 Thread via GitHub


danny0405 commented on PR #8675:
URL: https://github.com/apache/hudi/pull/8675#issuecomment-1545211244

   Approved the PR first, we can address the delete tests in following PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan merged pull request #8694: [MINOR] Migrate azure-pipelines.yml with notes

2023-05-11 Thread via GitHub


xushiyan merged PR #8694:
URL: https://github.com/apache/hudi/pull/8694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Migrate azure-pipelines.yml with notes (#8694)

2023-05-11 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d0b292022a [MINOR] Migrate azure-pipelines.yml with notes (#8694)
9d0b292022a is described below

commit 9d0b292022aab3265ee988ddbdf9311e1b9548ec
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Fri May 12 13:57:27 2023 +0800

[MINOR] Migrate azure-pipelines.yml with notes (#8694)
---
 azure-pipelines.yml => azure-pipelines-20230430.yml | 4 
 1 file changed, 4 insertions(+)

diff --git a/azure-pipelines.yml b/azure-pipelines-20230430.yml
similarity index 98%
rename from azure-pipelines.yml
rename to azure-pipelines-20230430.yml
index c6d5aee372c..7d391d4a4c3 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines-20230430.yml
@@ -13,6 +13,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+# NOTE:
+# This config file defines how Azure CI runs tests with Spark 2.4 and Flink 
1.17 profiles.
+# PRs will need to keep in sync with master's version to trigger the CI runs.
+
 trigger:
   branches:
 include:



[GitHub] [hudi] hudi-bot commented on pull request #8688: [HUDI-6190] Append description in the HoodieTableFactory.checkRecordKey exception.

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8688:
URL: https://github.com/apache/hudi/pull/8688#issuecomment-1545197028

   
   ## CI report:
   
   * 30af5b04eadd78ca76c4160fda9bdd91b6f98e2d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17019)
 
   * 3b233a8683b3daa7f7168c29ec6eb901f0581b56 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17033)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan opened a new pull request, #8694: [MINOR] Migrate azure-pipelines.yml with notes

2023-05-11 Thread via GitHub


xushiyan opened a new pull request, #8694:
URL: https://github.com/apache/hudi/pull/8694

   ### Change Logs
   
   change azure pipeline's config to `azure-pipelines-20230430.yml` so that 
publish junit is disabled for all PRs.
   
   ### Impact
   
   PR's CI won't be triggered until rebase master
   
   ### Risk level
   
   Low
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8688: [HUDI-6190] Append description in the HoodieTableFactory.checkRecordKey exception.

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8688:
URL: https://github.com/apache/hudi/pull/8688#issuecomment-1545191798

   
   ## CI report:
   
   * 30af5b04eadd78ca76c4160fda9bdd91b6f98e2d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17019)
 
   * 3b233a8683b3daa7f7168c29ec6eb901f0581b56 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jpechane commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-11 Thread via GitHub


jpechane commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1545173516

   @Sam-Serpoosh Hi, Debezium does not do any serialization. It just prepares 
data structure described with Kafka Connect schema. The serialization itself is 
done by Avro converter provide by Confluent. Debezium is unable to influence 
the serialization in any way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-11 Thread via GitHub


danny0405 commented on PR #8505:
URL: https://github.com/apache/hudi/pull/8505#issuecomment-1545118849

   
[6106.patch.zip](https://github.com/apache/hudi/files/11459135/6106.patch.zip)
   Thanks for the contribution, I have reviewed and created a patch~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8582: [HUDI-6142] Refactor the code related to creating user-defined index

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8582:
URL: https://github.com/apache/hudi/pull/8582#issuecomment-1545109750

   
   ## CI report:
   
   * 83e1a52f92683b26d84eb4731c1829eb8a9aa084 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16889)
 
   * 036058900c0af46cf5fc83b467399f2675d39206 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17032)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8669: [HUDI-5362] Rebase IncrementalRelation over HoodieBaseRelation

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8669:
URL: https://github.com/apache/hudi/pull/8669#issuecomment-1545098262

   
   ## CI report:
   
   * 0eacefd8bc063e0c574068f09670014804f10dc2 UNKNOWN
   * 9a00f9d6ece8c5b290975988e1bc40f2ba7ff91b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17030)
 
   * b6b0fa0b3bf274c542fa385c6b9dfe8df69925b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17031)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8582: [HUDI-6142] Refactor the code related to creating user-defined index

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8582:
URL: https://github.com/apache/hudi/pull/8582#issuecomment-1545097818

   
   ## CI report:
   
   * 83e1a52f92683b26d84eb4731c1829eb8a9aa084 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16889)
 
   * 036058900c0af46cf5fc83b467399f2675d39206 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 closed issue #6965: [SUPPORT]Data can be found in the latest partition of hudi table, but not in the historical partition.

2023-05-11 Thread via GitHub


eric9204 closed issue #6965: [SUPPORT]Data can be found in the latest partition 
of hudi table, but not in the historical partition.
URL: https://github.com/apache/hudi/issues/6965


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 closed issue #6966: [SUPPORT]HoodieWriteHandle: Error writing record HoodieRecord{key=HoodieKey { recordKey=id308723 partitionPath=202210141643}, currentLocation='null', newLo

2023-05-11 Thread via GitHub


eric9204 closed issue #6966: [SUPPORT]HoodieWriteHandle: Error writing record 
HoodieRecord{key=HoodieKey { recordKey=id308723 partitionPath=202210141643}, 
currentLocation='null', newLocation='null'}
URL: https://github.com/apache/hudi/issues/6966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8680: [HUDI-6192] Make HoodieFlinkCompactor and HoodieFlinkClusteringJob service mode as long runnning streaming job

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8680:
URL: https://github.com/apache/hudi/pull/8680#discussion_r1191890148


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -259,93 +257,105 @@ private void cluster() throws Exception {
 table.getMetaClient().reloadActiveTimeline();
   }
 
-  // fetch the instant based on the configured execution sequence
-  List instants = 
ClusteringUtils.getPendingClusteringInstantTimes(table.getMetaClient());
-  if (instants.isEmpty()) {
-// do nothing.
-LOG.info("No clustering plan scheduled, turns on the clustering plan 
schedule with --schedule option");
-return;
-  }
-
-  final HoodieInstant clusteringInstant;
-  if (cfg.clusteringInstantTime != null) {
-clusteringInstant = instants.stream()
-.filter(i -> i.getTimestamp().equals(cfg.clusteringInstantTime))
-.findFirst()
-.orElseThrow(() -> new HoodieException("Clustering instant [" + 
cfg.clusteringInstantTime + "] not found"));
+  StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(conf);
+
+  int clusteringParallelism;
+  DataStream planStream;
+  HoodieInstant clusteringInstant = null;
+  if (serviceMode) {
+clusteringParallelism = conf.getInteger(FlinkOptions.CLUSTERING_TASKS);
+planStream = env.addSource(new 
ServiceSourceFunction(conf.get(ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL)))
+.name("clustering_service_source")
+.uid("uid_clustering_service_source")
+.setParallelism(1)
+.transform("cluster_plan_generate", 
TypeInformation.of(ClusteringPlanEvent.class), new ClusteringPlanOperator(conf))
+.setParallelism(1);
   } else {
-// check for inflight clustering plans and roll them back if required
-clusteringInstant =
-CompactionUtil.isLIFO(cfg.clusteringSeq) ? 
instants.get(instants.size() - 1) : instants.get(0);
-  }
+// fetch the instant based on the configured execution sequence
+List instants = 
ClusteringUtils.getPendingClusteringInstantTimes(table.getMetaClient());
+if (instants.isEmpty()) {

Review Comment:
   @danny0405, my idea is to keep the same clustering execution pipeline of 
inline clustering, which at least ensures that the problem of this streaming 
job is the same as that of inline clustering. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8680: [HUDI-6192] Make HoodieFlinkCompactor and HoodieFlinkClusteringJob service mode as long runnning streaming job

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8680:
URL: https://github.com/apache/hudi/pull/8680#discussion_r1191888621


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -259,93 +257,105 @@ private void cluster() throws Exception {
 table.getMetaClient().reloadActiveTimeline();
   }
 
-  // fetch the instant based on the configured execution sequence
-  List instants = 
ClusteringUtils.getPendingClusteringInstantTimes(table.getMetaClient());
-  if (instants.isEmpty()) {
-// do nothing.
-LOG.info("No clustering plan scheduled, turns on the clustering plan 
schedule with --schedule option");
-return;
-  }
-
-  final HoodieInstant clusteringInstant;
-  if (cfg.clusteringInstantTime != null) {
-clusteringInstant = instants.stream()
-.filter(i -> i.getTimestamp().equals(cfg.clusteringInstantTime))
-.findFirst()
-.orElseThrow(() -> new HoodieException("Clustering instant [" + 
cfg.clusteringInstantTime + "] not found"));
+  StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(conf);
+
+  int clusteringParallelism;
+  DataStream planStream;
+  HoodieInstant clusteringInstant = null;
+  if (serviceMode) {
+clusteringParallelism = conf.getInteger(FlinkOptions.CLUSTERING_TASKS);
+planStream = env.addSource(new 
ServiceSourceFunction(conf.get(ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL)))
+.name("clustering_service_source")

Review Comment:
   @danny0405, I don't think we need exactly-once semantic because this only 
requires to add a long running source to create the clustering execution 
pipeline. The semantic is guaranteed via `ClusteringPlanOperator`, 
`ClusteringOperator` and `ClusteringCommitSink`. Meanwhile, I have tested the 
streaming job in internal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191885108


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] codope commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


codope commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191884246


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for Hudi 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191883810


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] codope commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


codope commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191882249


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for Hudi 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191882046


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   @codope, reviving that devlist thread makes sense to me. But we could list 
this purpose for streaming lakehouse in this RFC. cc @vinothchandar @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191882046


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   @codope, reviving that devlist thread makes sense to me. But we could list 
this purpose for streaming lakehouse in this RFC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


codope commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191881376


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for Hudi 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191880850


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191877066


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 

Review Comment:
   @kazdy, 

[GitHub] [hudi] hudi-bot commented on pull request #8669: [HUDI-5362] Rebase IncrementalRelation over HoodieBaseRelation

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8669:
URL: https://github.com/apache/hudi/pull/8669#issuecomment-1545060546

   
   ## CI report:
   
   * 0eacefd8bc063e0c574068f09670014804f10dc2 UNKNOWN
   * 9a00f9d6ece8c5b290975988e1bc40f2ba7ff91b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17030)
 
   * b6b0fa0b3bf274c542fa385c6b9dfe8df69925b1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8669: [HUDI-5362] Rebase IncrementalRelation over HoodieBaseRelation

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8669:
URL: https://github.com/apache/hudi/pull/8669#issuecomment-1545056608

   
   ## CI report:
   
   * 9b8fd1cd5d56d58fc52d334a54e326c405fadf53 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16966)
 
   * 0eacefd8bc063e0c574068f09670014804f10dc2 UNKNOWN
   * 9a00f9d6ece8c5b290975988e1bc40f2ba7ff91b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] weimingdiit commented on a diff in pull request #8301: [HUDI-5988] Add a param, Implement a full partition sync operation wh…

2023-05-11 Thread via GitHub


weimingdiit commented on code in PR #8301:
URL: https://github.com/apache/hudi/pull/8301#discussion_r1191865864


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##
@@ -163,6 +163,11 @@ public class HoodieSyncConfig extends HoodieConfig {
   .defaultValue("")
   .withDocumentation("The spark version used when syncing with a 
metastore.");
 
+  public static final ConfigProperty META_SYNC_PARTITION_FIXMODE = 
ConfigProperty
+  .key("hoodie.datasource.hive_sync.partition_fixmode")
+  .defaultValue("false")
+  .withDocumentation("Implement a full partition sync operation when 
partitions are lost.");

Review Comment:
   @yihua @danny0405 
   Maybe I didn't describe it clearly. The purpose of this pr is to provide a 
tool parameter to control whether a full partition synchronization alignment 
operation is required when the metadata of the synchronization partition is 
found to be lost.
   
   Looking at the current code logic is to do incremental synchronization 
according to lastCommitTimeSynced. If set to true, according to the code, the 
syncAllPartitions method will be used every time to synchronize all partitions, 
which is unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


codope commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191867805


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   Good point! I think @SteNicholas is pointing towards more general-purpose 
streaming capabilities such as watermarks, windows and accumulators - 
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/. Please 
correct me if i'm wrong.
   We should certainly revive that devlist thread for a detailed discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] weimingdiit commented on a diff in pull request #8301: [HUDI-5988] Add a param, Implement a full partition sync operation wh…

2023-05-11 Thread via GitHub


weimingdiit commented on code in PR #8301:
URL: https://github.com/apache/hudi/pull/8301#discussion_r1191865864


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##
@@ -163,6 +163,11 @@ public class HoodieSyncConfig extends HoodieConfig {
   .defaultValue("")
   .withDocumentation("The spark version used when syncing with a 
metastore.");
 
+  public static final ConfigProperty META_SYNC_PARTITION_FIXMODE = 
ConfigProperty
+  .key("hoodie.datasource.hive_sync.partition_fixmode")
+  .defaultValue("false")
+  .withDocumentation("Implement a full partition sync operation when 
partitions are lost.");

Review Comment:
   @yihua @danny0405 
   Maybe I didn't describe it clearly. The purpose of this pr is to provide a 
tool parameter when the metadata of the synchronization partition is found to 
be lost, **and the function of this parameter is not an incremental 
synchronization partition, but a parameter switch to control whether to use 
Perform a synchronous alignment operation for all partitions.**
   
   Looking at the current code logic is to do incremental synchronization 
according to lastCommitTimeSynced. If set to true, according to the code, the 
syncAllPartitions method will be used every time to synchronize all partitions, 
which is unnecessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Sam-Serpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-11 Thread via GitHub


Sam-Serpoosh commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1545048150

   I can reproduce this with a much simpler schema and corresponding Kafka 
key-value messages as well. Let's say we have this schema in our Confluent 
Schema Registry (SR):
   
   ```json
   {
 "type": "record",
 "name": "Envelope",
 "fields": [
   {
 "name": "before",
 "default": null,
 "type": [
   "null",
   {
 "name": "Value",
 "type": "record",
 "fields": [
   {
 "name": "id",
 "type": "int"
   },
   {
 "name": "fst_name",
 "type": "string"
   }
 ]
   }
 ]
   },
   {
 "name": "after",
 "default": null,
 "type": [
   "null",
   "Value"
 ]
   },
   {
 "name": "op",
 "type": "string"
   }
 ]
   }
   ```
   
   Then when we try to publish a message in the following format:
   
   ```json
   {
 "after": {
   "id": 10,
   "fst_name": "Bob"
 },  
 "before": null,
 "op": "c" 
   }
   ```
   
   The `kafka-avro-console-producer` throws up with this exception:
   
   ```
   Caused by: org.apache.avro.AvroTypeException: Unknown union branch id


 
   at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:434)


 
   at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:282)


   
   at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)

  
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)


   
   at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)


  
   at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)


 
   at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)

  
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)


   
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)


   
   at 
io.confluent.kafka.schemaregistry.avro.AvroSchemaUtils.toObject(AvroSchemaUtils.java:214)

  
   at 
io.confluent.kafka.formatter.AvroMessageReader.readFrom(AvroMessageReader.java:124)


   ... 3 more
   ```
   
   Changing the input message to the following format leads to a successful 
serializing and publishing to Kafka (simply wrapping id & fst_name inside a 
`Value` object):
   
   ```json
   {
 "after": {
   "Value": {   


 
 "id": 10,
   

[GitHub] [hudi] danny0405 commented on a diff in pull request #8689: [HUDI-6197] Fix use CONTAINER_ID to judge hudi is running in yarn con…

2023-05-11 Thread via GitHub


danny0405 commented on code in PR #8689:
URL: https://github.com/apache/hudi/pull/8689#discussion_r1191858073


##
hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java:
##
@@ -226,16 +226,17 @@ public static String[] getConfiguredLocalDirs() {
 
   private static boolean isRunningInYarnContainer() {
 // These environment variables are set by YARN.
-return System.getenv("CONTAINER_ID") != null;
+return System.getenv("CONTAINER_ID") != null
+&& System.getenv("LOCAL_DIRS") != null;

Review Comment:
   What's nuances, the logic seems not change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-11 Thread via GitHub


danny0405 commented on code in PR #8505:
URL: https://github.com/apache/hudi/pull/8505#discussion_r1191855786


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/offlinejob/TestOfflineHoodieCompactor.java:
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.offlinejob;
+
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.HoodieStorageConfig;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.config.HoodieCleanConfig;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieLayoutConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner;
+import org.apache.hudi.table.storage.HoodieStorageLayout;
+import org.apache.hudi.utilities.HoodieCompactor;
+
+import org.junit.jupiter.api.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Properties;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+
+public class TestOfflineHoodieCompactor extends HoodieOfflineJobTestBase {
+
+  protected HoodieCompactor initialHoodieCompactorClean(String tableBasePath, 
Boolean runSchedule, String scheduleAndExecute,
+ Boolean isAutoClean) {

Review Comment:
   Yeah, that's true, just move the test into it should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8669: [HUDI-5362] Rebase IncrementalRelation over HoodieBaseRelation

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8669:
URL: https://github.com/apache/hudi/pull/8669#issuecomment-1545026508

   
   ## CI report:
   
   * 9b8fd1cd5d56d58fc52d334a54e326c405fadf53 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16966)
 
   * 0eacefd8bc063e0c574068f09670014804f10dc2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-11 Thread via GitHub


zhuanshenbsj1 commented on code in PR #8505:
URL: https://github.com/apache/hudi/pull/8505#discussion_r1191850116


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/offlinejob/TestOfflineHoodieCompactor.java:
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.offlinejob;
+
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.HoodieStorageConfig;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.config.HoodieCleanConfig;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieLayoutConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner;
+import org.apache.hudi.table.storage.HoodieStorageLayout;
+import org.apache.hudi.utilities.HoodieCompactor;
+
+import org.junit.jupiter.api.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Properties;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+
+public class TestOfflineHoodieCompactor extends HoodieOfflineJobTestBase {
+
+  protected HoodieCompactor initialHoodieCompactorClean(String tableBasePath, 
Boolean runSchedule, String scheduleAndExecute,
+ Boolean isAutoClean) {

Review Comment:
   There seems to be no testing class for 
org.apache.hudi.utilities.HoodieCompactor before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-11 Thread via GitHub


zhuanshenbsj1 commented on code in PR #8505:
URL: https://github.com/apache/hudi/pull/8505#discussion_r1191848563


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/offlinejob/TestOfflineHoodieCompactor.java:
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.offlinejob;
+
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.HoodieStorageConfig;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.config.HoodieCleanConfig;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieLayoutConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner;
+import org.apache.hudi.table.storage.HoodieStorageLayout;
+import org.apache.hudi.utilities.HoodieCompactor;
+
+import org.junit.jupiter.api.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Properties;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+
+public class TestOfflineHoodieCompactor extends HoodieOfflineJobTestBase {
+
+  protected HoodieCompactor initialHoodieCompactorClean(String tableBasePath, 
Boolean runSchedule, String scheduleAndExecute,
+ Boolean isAutoClean) {

Review Comment:
   > Can we move the tests to `TestHoodieCompactor` ?
   
   This test class belongs to the project hudi-spark-client(not 
hudi-utilities), and is mainly used to test online compaction. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] houhang1005 commented on a diff in pull request #8688: [HUDI-6190] Append description in the HoodieTableFactory.checkRecordKey exception.

2023-05-11 Thread via GitHub


houhang1005 commented on code in PR #8688:
URL: https://github.com/apache/hudi/pull/8688#discussion_r1191841036


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java:
##
@@ -185,7 +185,9 @@ private void checkRecordKey(Configuration conf, 
ResolvedSchema schema) {
   && FlinkOptions.RECORD_KEY_FIELD.defaultValue().equals(recordKeys[0])
   && !fields.contains(recordKeys[0])) {
 throw new HoodieValidationException("Primary key definition is 
required, use either PRIMARY KEY syntax "
-+ "or option '" + FlinkOptions.RECORD_KEY_FIELD.key() + "' to 
specify.");
++ "or option '" + FlinkOptions.RECORD_KEY_FIELD.key() + "' to 
specify. "
++ "Otherwise the default primary key '"

Review Comment:
   Sure, I will do it today -.-



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8693: [DNM][HUDI-6204] Test bundle validation on Spark 3.3.2 with older commits

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8693:
URL: https://github.com/apache/hudi/pull/8693#issuecomment-1544979697

   
   ## CI report:
   
   * 9c6c5ff60fe6fbbf7c6484d91f81dadc68a46008 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17028)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-4630] Add transformer capability to individual feeds in MultiTableDeltaStreamer (#8399)

2023-05-11 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b497ef1a3f0 [HUDI-4630] Add transformer capability to individual feeds 
in MultiTableDeltaStreamer (#8399)
b497ef1a3f0 is described below

commit b497ef1a3f09c50bca889eeb457be70f1c6544c6
Author: Santhosh Kumar M <8852302+yesemsanthoshku...@users.noreply.github.com>
AuthorDate: Fri May 12 06:23:18 2023 +0530

[HUDI-4630] Add transformer capability to individual feeds in 
MultiTableDeltaStreamer (#8399)
---
 .../deltastreamer/HoodieMultiTableDeltaStreamer.java   | 10 ++
 .../deltastreamer/TestHoodieMultiTableDeltaStreamer.java   | 14 ++
 .../short_trip_uber_config.properties  |  1 +
 3 files changed, 25 insertions(+)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
index 697eccad831..3b5930f1559 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
@@ -134,6 +134,7 @@ public class HoodieMultiTableDeltaStreamer {
   if (cfg.enableMetaSync && 
StringUtils.isNullOrEmpty(tableProperties.getString(HoodieSyncConfig.META_SYNC_TABLE_NAME.key(),
 ""))) {
 throw new HoodieException("Meta sync table field not provided!");
   }
+  populateTransformerProps(cfg, tableProperties);
   populateSchemaProviderProps(cfg, tableProperties);
   executionContext = new TableExecutionContext();
   executionContext.setProperties(tableProperties);
@@ -144,6 +145,14 @@ public class HoodieMultiTableDeltaStreamer {
 }
   }
 
+  private void populateTransformerProps(HoodieDeltaStreamer.Config cfg, 
TypedProperties typedProperties) {
+String transformerClass = 
typedProperties.getString(Constants.TRANSFORMER_CLASS, null);
+if (transformerClass != null && !transformerClass.trim().isEmpty()) {
+  List transformerClassNameOverride = 
Arrays.asList(transformerClass.split(","));
+  cfg.transformerClassNames = transformerClassNameOverride;
+}
+  }
+
   private List getTablesToBeIngested(TypedProperties properties) {
 String combinedTablesString = 
properties.getString(HoodieDeltaStreamerConfig.TABLES_TO_BE_INGESTED.key());
 if (combinedTablesString == null) {
@@ -453,6 +462,7 @@ public class HoodieMultiTableDeltaStreamer {
 private static final String DELIMITER = ".";
 private static final String UNDERSCORE = "_";
 private static final String COMMA_SEPARATOR = ",";
+private static final String TRANSFORMER_CLASS = 
"hoodie.deltastreamer.transformer.class";
   }
 
   public Set getSuccessTables() {
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieMultiTableDeltaStreamer.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieMultiTableDeltaStreamer.java
index d6121a5b500..4d6235779a1 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieMultiTableDeltaStreamer.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieMultiTableDeltaStreamer.java
@@ -42,6 +42,7 @@ import java.util.Arrays;
 import java.util.List;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
@@ -76,6 +77,16 @@ public class TestHoodieMultiTableDeltaStreamer extends 
HoodieDeltaStreamerTestBa
 }
   }
 
+  @Test
+  public void testEmptyTransformerProps() throws IOException {
+// HUDI-4630: If there is no transformer props passed through, don't 
populate the transformerClassNames
+HoodieMultiTableDeltaStreamer.Config cfg = 
TestHelpers.getConfig(PROPS_FILENAME_TEST_SOURCE1, basePath + "/config", 
TestDataSource.class.getName(), false, false, null);
+HoodieDeltaStreamer.Config dsConfig = new HoodieDeltaStreamer.Config();
+TypedProperties tblProperties = new TypedProperties();
+HoodieMultiTableDeltaStreamer streamer = new 
HoodieMultiTableDeltaStreamer(cfg, jsc);
+assertNull(cfg.transformerClassNames);
+  }
+  
   @Test
   public void testMetaSyncConfig() throws IOException {
 HoodieMultiTableDeltaStreamer.Config cfg = 
TestHelpers.getConfig(PROPS_FILENAME_TEST_SOURCE1, basePath + "/config", 
TestDataSource.class.getName(), true, true, null);
@@ -244,10 +255,13 @@ public class TestHoodieMultiTableDeltaStreamer extends 
HoodieDeltaStreamerTestBa
 case "dummy_table_short_trip":
   

[GitHub] [hudi] bvaradar merged pull request #8399: [HUDI-4630] Add transformer capability to individual feeds in MultiTableDeltaStreamer

2023-05-11 Thread via GitHub


bvaradar merged PR #8399:
URL: https://github.com/apache/hudi/pull/8399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Sam-Serpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-11 Thread via GitHub


Sam-Serpoosh commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1544940333

   @jpechane This seems to be related to `Debezium` IIUC and how it's 
serializing the CDC events prior to publishing them to Kafka. As detailed in 
[this 
comment](https://github.com/apache/hudi/issues/8519#issuecomment-1542967885) 
and [this 
one](https://github.com/apache/hudi/issues/8519#issuecomment-1544320455), there 
is an **extra/nested** object/record field named `Value` under `after` or 
`before` and not sure why that's the case.
   
   The `before` and `after` fields' type is a **union type** looking like:
   
   ```json
   {
 "name": "before",
 "type": [
   "null",
   {
 "type": "record",
 "name": "Value",
 "fields": [
   {
 "name": "id",
 "type": {
   "type": "int",
   "connect.default": 0
 },
 "default": 0
   },
   {
 "name": "name",
 "type": "string"
   },
   {
 "name": "age",
 "type": "int"
   },
   {
 "name": "created_at",
 "type": [
   "null",
   {
 "type": "long",
 "connect.version": 1,
 "connect.name": "io.debezium.time.MicroTimestamp"
   }
 ],
 "default": null
   },
   {
 "name": "event_ts",
 "type": [
   "null",
   "long"
 ],
 "default": null
   }
 ],
 "connect.name": 
"..samser_customers.Value"
   }
 ],
 "default": null
   },
   {
 "name": "after",
 "type": [
   "null",
   "Value"
 ],
 "default": null
   },
   ...
   }
   ```
   
   However when I consume/deserialize events using Confluent's 
`kafka-avro-console-consumer`, I see the `before` field has an/a 
**OBJECT/RECORD** field named `Value` under it and then fields (e.g. `id` and 
`name`) are associated with that instead of directly being associated with the 
`before` field. According to the aforementioned Avro schema, **Value** is just 
the TYPE of the `before` field. But for some reason it comes out as a **field** 
so we end up with `before.Value.id` (or `after.Value.id`) instead of `after.id`.
   
   Any thoughts on why this is happening? We don't see this behavior in the 
case of the `source` field (whose types is **also** a **record**) and that 
field is showing the correct behavior. In case needed, here's my Debezium 
Connector configuration:
   
   ```
   schema.include.list: public
   key.converter: io.confluent.connect.avro.AvroConverter
   key.converter.schema.registry.url: http://:8081
   value.converter: io.confluent.connect.avro.AvroConverter
   value.converter.schema.registry.url: http://:8081
   table.include.list: public.samser_customers
   topic.creation.enable: true
   topic.creation.default.replication.factor: 1
   topic.creation.default.partitions: 1
   topic.creation.default.cleanup.policy: compact
   topic.creation.default.compression.type: lz4
   decimal.handling.mode: double
   tombstones.on.delete: false
   ```
   
   Thank you very much in advance appreciate your help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544930398

   
   ## CI report:
   
   * 3f4a740c6e9df40b04416e8c9632eec06487f76c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17027)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] clownxc commented on a diff in pull request #8659: [HUDI-6155] Fix cleaner based on hours for earliest commit to retain

2023-05-11 Thread via GitHub


clownxc commented on code in PR #8659:
URL: https://github.com/apache/hudi/pull/8659#discussion_r1191784979


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -144,4 +144,10 @@ public static boolean isValidInstantTime(String 
instantTime) {
   return false;
 }
   }
+
+  private static ZoneId getZoneId() {
+return commitTimeZone.equals(HoodieTimelineTimeZone.LOCAL)
+? ZoneId.systemDefault()

Review Comment:
   > metaClient
   
   I sees, I will try to modify the code as you say.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] clownxc commented on a diff in pull request #8659: [HUDI-6155] Fix cleaner based on hours for earliest commit to retain

2023-05-11 Thread via GitHub


clownxc commented on code in PR #8659:
URL: https://github.com/apache/hudi/pull/8659#discussion_r1191785146


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -144,4 +144,10 @@ public static boolean isValidInstantTime(String 
instantTime) {
   return false;
 }
   }
+
+  private static ZoneId getZoneId() {
+return commitTimeZone.equals(HoodieTimelineTimeZone.LOCAL)
+? ZoneId.systemDefault()

Review Comment:
   > See the discussions we take in: #8631
   
   I sees, I will try to modify the code as you say.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8692: [HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8692:
URL: https://github.com/apache/hudi/pull/8692#issuecomment-1544830453

   
   ## CI report:
   
   * 1c61adbdb5aea908ddc4c981fa871988f3764983 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17026)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-11 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1544825092

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * ce89b12639ebe78146afcd2f9c95d646226f1127 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17025)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1191727349


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -562,53 +532,144 @@ private  boolean 
isCommitRevertedByInFlightAction(
   /**
* Initialize the Metadata Table by listing files and partitions from the 
file system.
*
-   * @param dataMetaClient   - {@code HoodieTableMetaClient} for the 
dataset.
+   * @param initializationTime   - Timestamp to use for the commit
+   * @param partitionsToInit - List of MDT partitions to initialize
* @param inflightInstantTimestamp - Current action instant responsible for 
this initialization
*/
-  private boolean initializeFromFilesystem(HoodieTableMetaClient 
dataMetaClient,
+  private boolean initializeFromFilesystem(String initializationTime, 
List partitionsToInit,
Option 
inflightInstantTimestamp) throws IOException {
 if (anyPendingDataInstant(dataMetaClient, inflightInstantTimestamp)) {
   return false;
 }
 
-String createInstantTime = getInitialCommitInstantTime(dataMetaClient);
-
-initializeMetaClient(DEFAULT_METADATA_POPULATE_META_FIELDS);
-initTableMetadata();
-// if async metadata indexing is enabled,
-// then only initialize files partition as other partitions will be built 
using HoodieIndexer
-List enabledPartitionTypes =  new ArrayList<>();
-if (dataWriteConfig.isMetadataAsyncIndex()) {
-  enabledPartitionTypes.add(MetadataPartitionType.FILES);
-} else {
-  // all enabled ones should be initialized
-  enabledPartitionTypes = this.enabledPartitionTypes;
+// FILES partition is always initialized first
+
ValidationUtils.checkArgument(!partitionsToInit.contains(MetadataPartitionType.FILES)
+|| partitionsToInit.get(0).equals(MetadataPartitionType.FILES), 
"FILES partition should be initialized first: " + partitionsToInit);
+
+metadataMetaClient = initializeMetaClient();
+
+// Get a complete list of files and partitions from the file system or 
from already initialized FILES partition of MDT
+boolean filesPartitionAvailable = 
dataMetaClient.getTableConfig().isMetadataPartitionEnabled(MetadataPartitionType.FILES);
+List partitionInfoList = filesPartitionAvailable ? 
listAllPartitionsFromMDT(initializationTime) : 
listAllPartitionsFromFilesystem(initializationTime);
+Map> partitionToFilesMap = 
partitionInfoList.stream()
+.map(p -> {
+  String partitionName = 
HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
+  return Pair.of(partitionName, p.getFileNameToSizeMap());
+})
+.collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+
+for (MetadataPartitionType partitionType : partitionsToInit) {
+  // Find the commit timestamp to use for this partition. Each 
initialization should use its own unique commit time.
+  String commitTimeForPartition = 
generateUniqueCommitInstantTime(initializationTime);
+
+  LOG.info("Initializing MDT partition " + partitionType + " at instant " 
+ commitTimeForPartition);
+
+  Pair> fileGroupCountAndRecordsPair;
+  switch (partitionType) {
+case FILES:
+  fileGroupCountAndRecordsPair = 
initializeFilesPartition(initializationTime, partitionInfoList);
+  break;
+case BLOOM_FILTERS:
+  fileGroupCountAndRecordsPair = 
initializeBloomFiltersPartition(initializationTime, partitionToFilesMap);
+  break;
+case COLUMN_STATS:
+  fileGroupCountAndRecordsPair = 
initializeColumnStatsPartition(partitionToFilesMap);
+  break;
+default:
+  throw new HoodieMetadataException("Unsupported MDT partition type: " 
+ partitionType);
+  }
+
+  // Generate the file groups
+  final int fileGroupCount = fileGroupCountAndRecordsPair.getKey();
+  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionType + " should be > 0");
+  initializeFileGroups(dataMetaClient, partitionType, 
commitTimeForPartition, fileGroupCount);

Review Comment:
   lets add documentation as to why we need this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1191726475


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -562,53 +532,144 @@ private  boolean 
isCommitRevertedByInFlightAction(
   /**
* Initialize the Metadata Table by listing files and partitions from the 
file system.
*
-   * @param dataMetaClient   - {@code HoodieTableMetaClient} for the 
dataset.
+   * @param initializationTime   - Timestamp to use for the commit
+   * @param partitionsToInit - List of MDT partitions to initialize
* @param inflightInstantTimestamp - Current action instant responsible for 
this initialization
*/
-  private boolean initializeFromFilesystem(HoodieTableMetaClient 
dataMetaClient,
+  private boolean initializeFromFilesystem(String initializationTime, 
List partitionsToInit,
Option 
inflightInstantTimestamp) throws IOException {
 if (anyPendingDataInstant(dataMetaClient, inflightInstantTimestamp)) {
   return false;
 }
 
-String createInstantTime = getInitialCommitInstantTime(dataMetaClient);
-
-initializeMetaClient(DEFAULT_METADATA_POPULATE_META_FIELDS);
-initTableMetadata();
-// if async metadata indexing is enabled,
-// then only initialize files partition as other partitions will be built 
using HoodieIndexer
-List enabledPartitionTypes =  new ArrayList<>();
-if (dataWriteConfig.isMetadataAsyncIndex()) {
-  enabledPartitionTypes.add(MetadataPartitionType.FILES);
-} else {
-  // all enabled ones should be initialized
-  enabledPartitionTypes = this.enabledPartitionTypes;
+// FILES partition is always initialized first
+
ValidationUtils.checkArgument(!partitionsToInit.contains(MetadataPartitionType.FILES)
+|| partitionsToInit.get(0).equals(MetadataPartitionType.FILES), 
"FILES partition should be initialized first: " + partitionsToInit);
+
+metadataMetaClient = initializeMetaClient();
+
+// Get a complete list of files and partitions from the file system or 
from already initialized FILES partition of MDT
+boolean filesPartitionAvailable = 
dataMetaClient.getTableConfig().isMetadataPartitionEnabled(MetadataPartitionType.FILES);
+List partitionInfoList = filesPartitionAvailable ? 
listAllPartitionsFromMDT(initializationTime) : 
listAllPartitionsFromFilesystem(initializationTime);
+Map> partitionToFilesMap = 
partitionInfoList.stream()
+.map(p -> {
+  String partitionName = 
HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
+  return Pair.of(partitionName, p.getFileNameToSizeMap());
+})
+.collect(Collectors.toMap(Pair::getKey, Pair::getValue));
+
+for (MetadataPartitionType partitionType : partitionsToInit) {
+  // Find the commit timestamp to use for this partition. Each 
initialization should use its own unique commit time.
+  String commitTimeForPartition = 
generateUniqueCommitInstantTime(initializationTime);
+
+  LOG.info("Initializing MDT partition " + partitionType + " at instant " 
+ commitTimeForPartition);
+
+  Pair> fileGroupCountAndRecordsPair;
+  switch (partitionType) {
+case FILES:
+  fileGroupCountAndRecordsPair = 
initializeFilesPartition(initializationTime, partitionInfoList);
+  break;
+case BLOOM_FILTERS:
+  fileGroupCountAndRecordsPair = 
initializeBloomFiltersPartition(initializationTime, partitionToFilesMap);
+  break;
+case COLUMN_STATS:
+  fileGroupCountAndRecordsPair = 
initializeColumnStatsPartition(partitionToFilesMap);
+  break;
+default:
+  throw new HoodieMetadataException("Unsupported MDT partition type: " 
+ partitionType);
+  }
+
+  // Generate the file groups
+  final int fileGroupCount = fileGroupCountAndRecordsPair.getKey();
+  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionType + " should be > 0");
+  initializeFileGroups(dataMetaClient, partitionType, 
commitTimeForPartition, fileGroupCount);
+
+  // Perform the commit using bulkCommit
+  HoodieData records = 
fileGroupCountAndRecordsPair.getValue();
+  bulkCommit(commitTimeForPartition, partitionType, records, 
fileGroupCount);
+  metadataMetaClient.reloadActiveTimeline();
+  dataMetaClient = 
HoodieTableMetadataUtil.setMetadataPartitionState(dataMetaClient, 
partitionType, true);
 }
-initializeEnabledFileGroups(dataMetaClient, createInstantTime, 
enabledPartitionTypes);
-initialCommit(createInstantTime, enabledPartitionTypes);
-updateInitializedPartitionsInTableConfig(enabledPartitionTypes);
+
 return true;
   }
 
-  private String getInitialCommitInstantTime(HoodieTableMetaClient 
dataMetaClient) {
-// If there is no 

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1191719299


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieMetadataBulkInsertPartitioner.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+
+import scala.Tuple2;
+
+/**
+ * A {@code BulkInsertPartitioner} implementation for Metadata Table to 
improve performance of initialization of metadata
+ * table partition when a very large number of records are inserted.
+ *
+ * This partitioner requires the records to be already tagged with location.
+ */
+public class SparkHoodieMetadataBulkInsertPartitioner implements 
BulkInsertPartitioner> {
+  final int numPartitions;
+  public SparkHoodieMetadataBulkInsertPartitioner(int numPartitions) {
+this.numPartitions = numPartitions;
+  }
+
+  private class FileGroupPartitioner extends Partitioner {
+
+@Override
+public int getPartition(Object key) {
+  return ((Tuple2)key)._1;
+}
+
+@Override
+public int numPartitions() {
+  return numPartitions;
+}
+  }
+
+  // FileIDs for the various partitions
+  private List fileIDPfxs;
+
+  /**
+   * Partition the records by their location. The number of partitions is 
determined by the number of MDT fileGroups being udpated rather than the
+   * specific value of outputSparkPartitions.
+   */
+  @Override
+  public JavaRDD repartitionRecords(JavaRDD 
records, int outputSparkPartitions) {
+Comparator> keyComparator =
+(Comparator> & Serializable)(t1, t2) -> 
t1._2.compareTo(t2._2);
+
+// Partition the records by their file group
+JavaRDD partitionedRDD = records
+// key by . The file group index is 
used to partition and the record key is used to sort within the partition.
+.keyBy(r -> {
+  int fileGroupIndex = 
HoodieTableMetadataUtil.getFileGroupIndexFromFileId(r.getCurrentLocation().getFileId());
+  return new Tuple2<>(fileGroupIndex, r.getRecordKey());
+})
+.repartitionAndSortWithinPartitions(new FileGroupPartitioner(), 
keyComparator)
+.map(t -> t._2);
+
+fileIDPfxs = partitionedRDD.mapPartitions(recordItr -> {
+  // Due to partitioning, all record in the partition should have same 
fileID. So we only can get the fileID prefix from the first record.
+  List fileIds = new ArrayList<>(1);
+  if (recordItr.hasNext()) {
+HoodieRecord record = recordItr.next();
+final String fileID = 
HoodieTableMetadataUtil.getFileGroupPrefix(record.getCurrentLocation().getFileId());
+fileIds.add(fileID);
+  } else {
+// Empty partition
+fileIds.add("");

Review Comment:
   when  there are spark partitions which does not have records in them. can we 
add docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8693: [DNM][HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8693:
URL: https://github.com/apache/hudi/pull/8693#issuecomment-1544721752

   
   ## CI report:
   
   * 9c6c5ff60fe6fbbf7c6484d91f81dadc68a46008 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17028)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544721574

   
   ## CI report:
   
   * d9c1b28f2fdf9fc5390ffe2f99ca16a02d616ab4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17023)
 
   * 3f4a740c6e9df40b04416e8c9632eec06487f76c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17027)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


amrishlal commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544717144

   > @amrishlal : can you rebase w/ latest master. there was a flaky test that 
was fixed
   
   Merged latest changes from master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8693: [DNM][HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8693:
URL: https://github.com/apache/hudi/pull/8693#issuecomment-1544711988

   
   ## CI report:
   
   * 9c6c5ff60fe6fbbf7c6484d91f81dadc68a46008 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544711755

   
   ## CI report:
   
   * d9c1b28f2fdf9fc5390ffe2f99ca16a02d616ab4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17023)
 
   * 3f4a740c6e9df40b04416e8c9632eec06487f76c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8666:
URL: https://github.com/apache/hudi/pull/8666#issuecomment-1544703628

   
   ## CI report:
   
   * 1b2f28447ac507b35f82a0534ebd958a8fd8980d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17024)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8692: [HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8692:
URL: https://github.com/apache/hudi/pull/8692#issuecomment-1544703757

   
   ## CI report:
   
   * 1c61adbdb5aea908ddc4c981fa871988f3764983 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17026)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua opened a new pull request, #8693: [DNM][HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


yihua opened a new pull request, #8693:
URL: https://github.com/apache/hudi/pull/8693

   ### Change Logs
   
   Testing Spark 3.3.2 only.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8692: [HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8692:
URL: https://github.com/apache/hudi/pull/8692#issuecomment-1544654505

   
   ## CI report:
   
   * 1c61adbdb5aea908ddc4c981fa871988f3764983 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6204) Add Spark 3.3.2 in bundle validation

2023-05-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6204:
-
Labels: pull-request-available  (was: )

> Add Spark 3.3.2 in bundle validation
> 
>
> Key: HUDI-6204
> URL: https://issues.apache.org/jira/browse/HUDI-6204
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Validate bundles with Spark 3.3.2 runtime in GH actions to make sure Hudi 
> bundles works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #8692: [HUDI-6204] Add bundle validation on Spark 3.3.2

2023-05-11 Thread via GitHub


yihua opened a new pull request, #8692:
URL: https://github.com/apache/hudi/pull/8692

   ### Change Logs
   
   This PR adds the bundle validation on Spark 3.3.2 in Github Java CI to 
ensure compatibility after we fixed the compatibility issue in #8082.
   
   ### Impact
   
   Ensures Hudi works on Spark 3.3.2.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-11 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1544643609

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003)
 
   * ce89b12639ebe78146afcd2f9c95d646226f1127 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17025)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6204) Add Spark 3.3.2 in bundle validation

2023-05-11 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6204:
---

Assignee: Ethan Guo

> Add Spark 3.3.2 in bundle validation
> 
>
> Key: HUDI-6204
> URL: https://issues.apache.org/jira/browse/HUDI-6204
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6204) Add Spark 3.3.2 in bundle validation

2023-05-11 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6204:

Description: Validate bundles with Spark 3.3.2 runtime in GH actions to 
make sure Hudi bundles works.

> Add Spark 3.3.2 in bundle validation
> 
>
> Key: HUDI-6204
> URL: https://issues.apache.org/jira/browse/HUDI-6204
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.13.1
>
>
> Validate bundles with Spark 3.3.2 runtime in GH actions to make sure Hudi 
> bundles works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6204) Add Spark 3.3.2 in bundle validation

2023-05-11 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6204:

Fix Version/s: 0.13.1

> Add Spark 3.3.2 in bundle validation
> 
>
> Key: HUDI-6204
> URL: https://issues.apache.org/jira/browse/HUDI-6204
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6204) Add Spark 3.3.2 in bundle validation

2023-05-11 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6204:
---

 Summary: Add Spark 3.3.2 in bundle validation
 Key: HUDI-6204
 URL: https://issues.apache.org/jira/browse/HUDI-6204
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


nsivabalan commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544612061

   @amrishlal : can you rebase w/ latest master. there was a flaky test that 
was fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ankitchandnani commented on issue #8672: [SUPPORT] INSERT_OVERWRITE_TABLE operation not working on Hudi 0.12.2 using EMR Deltastreamer

2023-05-11 Thread via GitHub


ankitchandnani commented on issue #8672:
URL: https://github.com/apache/hudi/issues/8672#issuecomment-1544593450

   Hi @ad1happy2go any update on the above? Urgent to implement on my side. 
Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544562903

   
   ## CI report:
   
   * d9c1b28f2fdf9fc5390ffe2f99ca16a02d616ab4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17023)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7998: [HUDI-5824] Fix: do not combine if write operation is Upsert and COMBINE_BEFORE_UPSERT is false

2023-05-11 Thread via GitHub


hudi-bot commented on PR #7998:
URL: https://github.com/apache/hudi/pull/7998#issuecomment-1544561469

   
   ## CI report:
   
   * 27d61f01fb6709e3aaa08de9ace7738dbedffb24 UNKNOWN
   * b572d737ef10724f71642084c0edf9a9a26540cc UNKNOWN
   * a95196cb0c749c1e1e8fb245a2a58d429159d519 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17003)
 
   * ce89b12639ebe78146afcd2f9c95d646226f1127 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


kazdy commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191592950


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 

Review Comment:
   I cannot 

[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


kazdy commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191580728


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for Hudi 

[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8666:
URL: https://github.com/apache/hudi/pull/8666#issuecomment-1544499232

   
   ## CI report:
   
   * 840cfab05cacd7a4862b2ad9a1983af2953819a1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17016)
 
   * 1b2f28447ac507b35f82a0534ebd958a8fd8980d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17024)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


kazdy commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191551442


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 

Review Comment:
   If there are 

[GitHub] [hudi] kazdy commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


kazdy commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191551442


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 

Review Comment:
   If there are 

[GitHub] [hudi] hudi-bot commented on pull request #8666: [HUDI-915] Add missing partititonpath to records COW

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8666:
URL: https://github.com/apache/hudi/pull/8666#issuecomment-1544489175

   
   ## CI report:
   
   * 840cfab05cacd7a4862b2ad9a1983af2953819a1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17016)
 
   * 1b2f28447ac507b35f82a0534ebd958a8fd8980d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1191243539


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -873,17 +908,7 @@ public void buildMetadataPartitions(HoodieEngineContext 
engineContext, List {
   String relativePartitionPath = 
indexPartitionInfo.getMetadataPartitionPath();
   LOG.info(String.format("Creating a new metadata index for partition '%s' 
under path %s upto instant %s",
-  relativePartitionPath, metadataWriteConfig.getBasePath(), 
indexUptoInstantTime));
-  try {
-// file group should have already been initialized while scheduling 
index for this partition
-if (!dataMetaClient.getFs().exists(new 
Path(metadataWriteConfig.getBasePath(), relativePartitionPath))) {

Review Comment:
   can you point me to the code where we handle partial initialization failure. 
Guess this code is handling that. I assume we handle it elsewhere in this patch 
and hence have removed this .



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1097,87 +1165,76 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
 // Trigger cleaning with suffixes based on the same instant time. This 
ensures that any future
 // delta commits synced over will not have an instant time lesser than the 
last completed instant on the
 // metadata table.
-writeClient.clean(instantTime + "002");
+
writeClient.clean(HoodieTableMetadataUtil.createCleanTimestamp(instantTime));
 writeClient.lazyRollbackFailedIndexing();
   }
 
   /**
-   * This is invoked to initialize metadata table for a dataset.
-   * Initial commit has special handling mechanism due to its scale compared 
to other regular commits.
-   * During cold startup, the list of files to be committed can be huge.
-   * So creating a HoodieCommitMetadata out of these large number of files,
-   * and calling the existing update(HoodieCommitMetadata) function does not 
scale well.
-   * Hence, we have a special commit just for the initialization scenario.
+   * Validates the timeline for both main and metadata tables.
*/
-  private void initialCommit(String createInstantTime, 
List partitionTypes) {
-// List all partitions in the basePath of the containing dataset
-LOG.info("Initializing metadata table by using file listings in " + 
dataWriteConfig.getBasePath());
-engineContext.setJobStatus(this.getClass().getSimpleName(), "Initializing 
metadata table by listing files and partitions: " + 
dataWriteConfig.getTableName());
-
-Map> partitionToRecordsMap 
= new HashMap<>();
-
-// skip file system listing to populate metadata records if it's a fresh 
table.
-// this is applicable only if the table already has N commits and metadata 
is enabled at a later point in time.
-if (createInstantTime.equals(SOLO_COMMIT_TIMESTAMP)) { // 
SOLO_COMMIT_TIMESTAMP will be the initial commit time in MDT for a fresh table.
-  // If not, last completed commit in data table will be chosen as the 
initial commit time.
-  LOG.info("Triggering empty Commit to metadata to initialize");
-} else {
-  List partitionInfoList = 
listAllPartitions(dataMetaClient);
-  Map> partitionToFilesMap = 
partitionInfoList.stream()
-  .map(p -> {
-String partitionName = 
HoodieTableMetadataUtil.getPartitionIdentifier(p.getRelativePath());
-return Pair.of(partitionName, p.getFileNameToSizeMap());
-  })
-  .collect(Collectors.toMap(Pair::getKey, Pair::getValue));
-
-  int totalDataFilesCount = 
partitionToFilesMap.values().stream().mapToInt(Map::size).sum();
-  List partitions = new ArrayList<>(partitionToFilesMap.keySet());
-
-  if (partitionTypes.contains(MetadataPartitionType.FILES)) {
-// Record which saves the list of all partitions
-HoodieRecord allPartitionRecord = 
HoodieMetadataPayload.createPartitionListRecord(partitions);
-HoodieData filesPartitionRecords = 
getFilesPartitionRecords(createInstantTime, partitionInfoList, 
allPartitionRecord);
-ValidationUtils.checkState(filesPartitionRecords.count() == 
(partitions.size() + 1));
-partitionToRecordsMap.put(MetadataPartitionType.FILES, 
filesPartitionRecords);
-  }
-
-  if (partitionTypes.contains(MetadataPartitionType.BLOOM_FILTERS) && 
totalDataFilesCount > 0) {
-final HoodieData recordsRDD = 
HoodieTableMetadataUtil.convertFilesToBloomFilterRecords(
-engineContext, Collections.emptyMap(), partitionToFilesMap, 
getRecordsGenerationParams(), createInstantTime);
-partitionToRecordsMap.put(MetadataPartitionType.BLOOM_FILTERS, 
recordsRDD);
-  }
-
-  if (partitionTypes.contains(MetadataPartitionType.COLUMN_STATS) && 
totalDataFilesCount > 0) {
-final 

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


the-other-tim-brown commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191526949


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -93,9 +103,13 @@ public List getTransformersNames() {
   @Override
   public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties) {
 Dataset dataset = rowDataset;
+Option incomingSchemaOpt = sourceSchemaOpt;
 for (TransformerInfo transformerInfo : transformers) {
   Transformer transformer = transformerInfo.getTransformer();
   dataset = transformer.apply(jsc, sparkSession, dataset, 
transformerInfo.getProperties(properties));

Review Comment:
   just a reminder from our discussion, this will likely through a logical plan 
error and not enter the section below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


the-other-tim-brown commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191525650


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   Another note on Avro vs StructType. There are subtle differences between the 
two and possible struct types that cannot be covered by Avro schemas like 
non-string map keys. Since this is some intermediate state, I think we should 
be using the schema format that represents this state so we don't getting into 
any edge cases where the user cannot define the proper schema of their row 
before and after the transform



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191524810


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 
+relational data model for 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191522370


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 
+to speak the warehouse users' and data engineers' languages, respectively), we 
made some conservative choices based on the ecosystem at that time. However, 
revisiting those choices is important to see if they still hold up.
+
+*   **Deep Query Engine Integrations:** Back then, query engines like Presto, 
Spark, Trino and Hive were getting good at queries on columnar data files but 
painfully hard to integrate into. Over time, we expected clear API abstractions 
+around indexing/metadata/table snapshots in the parquet/orc read paths that a 
project like Hudi can tap into to easily leverage innovations like 
Velox/PrestoDB. However, most engines preferred a separate integration - 
leading to Hudi maintaining its own Spark Datasource, 
+Presto and Trino connectors. However, this now opens up the opportunity to 
fully leverage Hudi's multi-modal indexing capabilities during query planning 
and execution.
+*   **Generalized Data Model:** While Hudi supported keys, we focused on 
updating Hudi tables as if they were a key-value store, while SQL queries ran 
on top, blissfully unchanged and unaware. Back then, generalizing the support 
for 
+keys felt premature based on where the ecosystem was, which was still doing 
large batch M/R jobs. Today, more performant, advanced engines like Apache 
Spark and Apache Flink have mature extensible SQL support that can support a 
generalized, 

Review Comment:
   @yihua 

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


SteNicholas commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191519037


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   @yihua, IMO, the proposal in above discussion is only about making the Flink 
writer in a streaming
   fashion. But the streaming lakehouse kid is mainly end-to-end streaming read 
and write to make  the data in the lake really streaming processing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6203) Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread Amrish Lal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrish Lal updated HUDI-6203:
-
Description: 
Provide file size stats for the latest updates that hudi is consuming. These 
stats are at table level by default, but specifying \-{-}enable-partition-stats 
will also show stats at the partition level. If a start date ({-}{-}start-date 
parameter) and/or end date ({-}{-}end-date parameter) are specified, stats are 
based on files that were modified in the half-open interval [start date 
({-}{-}start-date parameter), end date ({-}-end-date parameter)). --num-days 
parameter can be used to select data files over last --num-days. If 
--start-date is specified, --num-days will be ignored. If none of the date 
parameters are set, stats will be computed over all data files of all 
partitions in the table. Note that date filtering is carried out only if the 
partition name has the format '[column name=]-M-d', '[column 
name=]/M/d'.
The following stats are produced by this class:
 * Number of files.
 * Total table size.
 * Minimum file size
 * Maximum file size
 * Average file size
 * Median file size
 * p50 file size
 * p90 file size
 * p95 file size
 * p99 file size

  was:
Provide file size stats for the latest updates that hudi is consuming. These 
stats are at table level by default, but specifying --enable-partition-stats 
will also show stats at the partition level. If a start date (--start-date 
parameter) and/or end date (--end-date parameter) are specified, stats are 
based on files that were modified in the half-open interval [start date 
(--start-date parameter), end date (--end-date parameter)). --num-days 
parameter can be used to select data files over last --num-days. If 
--start-date is specified, --num-days will be ignored. If none of the date 
parameters are set, stats will be computed over all data files of all 
partitions in the table. Note that date filtering is carried out only if the 
partition name has the format '[column name=]-M-d', '[column 
name=]/M/d'.
The following stats are produced by this class:
 * Number of files.
 * Total table size.
 * Minimum file size
 * Maximum file size
 * Average file size
 * Median file size
 * p50 file size
 * p90 file size
 * p95 file size
 * p99 file size


> Add support to standalone utility tool to fetch file size stats for a given 
> table w/ optional partition filters
> ---
>
> Key: HUDI-6203
> URL: https://issues.apache.org/jira/browse/HUDI-6203
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Amrish Lal
>Priority: Major
>
> Provide file size stats for the latest updates that hudi is consuming. These 
> stats are at table level by default, but specifying 
> \-{-}enable-partition-stats will also show stats at the partition level. If a 
> start date ({-}{-}start-date parameter) and/or end date ({-}{-}end-date 
> parameter) are specified, stats are based on files that were modified in the 
> half-open interval [start date ({-}{-}start-date parameter), end date 
> ({-}-end-date parameter)). --num-days parameter can be used to select data 
> files over last --num-days. If --start-date is specified, --num-days will be 
> ignored. If none of the date parameters are set, stats will be computed over 
> all data files of all partitions in the table. Note that date filtering is 
> carried out only if the partition name has the format '[column 
> name=]-M-d', '[column name=]/M/d'.
> The following stats are produced by this class:
>  * Number of files.
>  * Total table size.
>  * Minimum file size
>  * Maximum file size
>  * Average file size
>  * Median file size
>  * p50 file size
>  * p90 file size
>  * p95 file size
>  * p99 file size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6203) Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread Amrish Lal (Jira)
Amrish Lal created HUDI-6203:


 Summary: Add support to standalone utility tool to fetch file size 
stats for a given table w/ optional partition filters
 Key: HUDI-6203
 URL: https://issues.apache.org/jira/browse/HUDI-6203
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Amrish Lal


Provide file size stats for the latest updates that hudi is consuming. These 
stats are at table level by default, but specifying --enable-partition-stats 
will also show stats at the partition level. If a start date (--start-date 
parameter) and/or end date (--end-date parameter) are specified, stats are 
based on files that were modified in the half-open interval [start date 
(--start-date parameter), end date (--end-date parameter)). --num-days 
parameter can be used to select data files over last --num-days. If 
--start-date is specified, --num-days will be ignored. If none of the date 
parameters are set, stats will be computed over all data files of all 
partitions in the table. Note that date filtering is carried out only if the 
partition name has the format '[column name=]-M-d', '[column 
name=]/M/d'.
The following stats are produced by this class:
 * Number of files.
 * Total table size.
 * Minimum file size
 * Maximum file size
 * Average file size
 * Median file size
 * p50 file size
 * p90 file size
 * p95 file size
 * p99 file size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


the-other-tim-brown commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191508388


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   I think this puts a large burden on the user to implement these schemas when 
there is an easier way to get a signal as to whether the combination of 
transformers is valid. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


lokeshj1703 commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191506356


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   The default would be the same as tranformed dataset schema then and schema 
validation would be successful by default. The idea is for API to provide an 
expected schema and then we verify using it and fail if it is not provided.
   
   Since schema provider provides the schema in Avro format, we are using Avro 
schema in the API. Just for consistency for the user.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   The default would be the same as tranformed dataset schema then and schema 
validation would be successful by default. The idea is for API to provide an 
expected schema and then we verify using it or fail if it is not provided.
   
   Since schema provider provides the schema in Avro format, we are using Avro 
schema in the API. Just for consistency for the user.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544418734

   
   ## CI report:
   
   * 9e5f2984e3f00bb24e1749922458072163a0df70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17014)
 
   * d9c1b28f2fdf9fc5390ffe2f99ca16a02d616ab4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17023)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


the-other-tim-brown commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191501718


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##
@@ -285,7 +285,8 @@ public DeltaSync(HoodieDeltaStreamer.Config cfg, 
SparkSession sparkSession, Sche
 // Register User Provided schema first
 registerAvroSchemas(schemaProvider);
 
-this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames));
+this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames),

Review Comment:
   We'll never factor these new fields into the validation. Is that ok?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1191499847


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##
@@ -285,7 +285,8 @@ public DeltaSync(HoodieDeltaStreamer.Config cfg, 
SparkSession sparkSession, Sche
 // Register User Provided schema first
 registerAvroSchemas(schemaProvider);
 
-this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames));
+this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames),

Review Comment:
   out of the box, our schema should only get wider and never narrower. So, 
from a schema (not data) standpoint, a field should never get dropped. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


hudi-bot commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544407140

   
   ## CI report:
   
   * 9e5f2984e3f00bb24e1749922458072163a0df70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17014)
 
   * d9c1b28f2fdf9fc5390ffe2f99ca16a02d616ab4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (e331141232c -> 2c2abaf14bd)

2023-05-11 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e331141232c [MINOR] Added docs on gotchas when using 
PartialUpdateAvroPayload (#8579)
 add 2c2abaf14bd [HUDI-6180] Use ConfigProperty for Timestamp keygen 
configs (#8643)

No new revisions were added by this update.

Summary of changes:
 .../keygen/TimestampBasedAvroKeyGenerator.java | 22 +++--
 .../keygen/parser/BaseHoodieDateTimeParser.java| 10 ++-
 .../hudi/keygen/parser/HoodieDateTimeParser.java   | 37 +---
 .../apache/hudi/common/config/ConfigGroups.java|  5 ++
 .../common/config/TimestampKeyGeneratorConfig.java | 99 ++
 .../hudi/common/table/HoodieTableConfig.java   | 22 +++--
 .../hudi/keygen/constant/KeyGeneratorOptions.java  | 41 +
 .../org/apache/hudi/table/HoodieTableFactory.java  | 26 +++---
 .../apache/hudi/table/TestHoodieTableFactory.java  | 10 ++-
 .../scala/org/apache/hudi/HoodieFileIndex.scala| 14 ++-
 .../keygen/TestTimestampBasedKeyGenerator.java | 39 +
 .../org/apache/hudi/TestHoodieFileIndex.scala  |  8 +-
 .../apache/hudi/functional/TestCOWDataSource.scala | 43 +-
 .../hudi/functional/TestCOWDataSourceStorage.scala |  8 +-
 .../apache/hudi/functional/TestMORDataSource.scala | 10 +--
 15 files changed, 274 insertions(+), 120 deletions(-)
 create mode 100644 
hudi-common/src/main/java/org/apache/hudi/common/config/TimestampKeyGeneratorConfig.java



[GitHub] [hudi] yihua merged pull request #8643: [HUDI-6180] Use ConfigProperty for Timestamp keygen configs

2023-05-11 Thread via GitHub


yihua merged PR #8643:
URL: https://github.com/apache/hudi/pull/8643


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


amrishlal commented on PR #8645:
URL: https://github.com/apache/hudi/pull/8645#issuecomment-1544387998

   > can you attach some sample output
   
   Sample output attached in description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1191480934


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] amrishlal commented on a diff in pull request #8645: [HUDI-6193] Add support to standalone utility tool to fetch file size stats for a given table w/ optional partition filters

2023-05-11 Thread via GitHub


amrishlal commented on code in PR #8645:
URL: https://github.com/apache/hudi/pull/8645#discussion_r1190630713


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/TableSizeStats.java:
##
@@ -0,0 +1,411 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.codahale.metrics.Histogram;
+import com.codahale.metrics.Snapshot;
+import com.codahale.metrics.UniformReservoir;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.io.Serializable;
+import java.time.LocalDate;
+import java.time.format.DateTimeFormatter;
+import java.time.format.DateTimeFormatterBuilder;
+import java.time.format.DateTimeParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+
+/**
+ * Calculate and output file size stats of data files that were modified in 
the half-open interval [start date (--start-date parameter),
+ * end date (--end-date parameter)). --num-days parameter can be used to 
select data files over last --num-days. If --start-date is
+ * specified, --num-days will be ignored. If none of the date parameters are 
set, stats will be computed over all data files of all
+ * partitions in the table. Note that date filtering is carried out only if 
the partition name has the format '[column name=]-M-d',
+ * '[column name=]/M/d'. By default, only table level file size stats are 
printed. If --partition-status option is used, partition
+ * level file size stats also get printed.
+ * 
+ * The following stats are calculated:
+ * Number of files.
+ * Total table size.
+ * Minimum file size
+ * Maximum file size
+ * Average file size
+ * Median file size
+ * p50 file size
+ * p90 file size
+ * p95 file size
+ * p99 file size
+ * 
+ * Sample spark-submit command:
+ * ./bin/spark-submit \
+ * --class org.apache.hudi.utilities.TableSizeStats \
+ * 
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
+ * --base-path  \
+ * --num-days 
+ */
+public class TableSizeStats implements Serializable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableSizeStats.class);
+
+  // Date formatter for parsing partition dates (example: 2023/5/5/ or 
2023-5-5).
+  private static final DateTimeFormatter DATE_FORMATTER =
+  (new 
DateTimeFormatterBuilder()).appendOptional(DateTimeFormatter.ofPattern("/M/d")).appendOptional(DateTimeFormatter.ofPattern("-M-d")).toFormatter();
+
+  // File size stats will be displayed in the units specified below.
+  private static final String[] FILE_SIZE_UNITS = {"B", "KB", "MB", "GB", 
"TB"};
+
+  // Spark context
+  private transient JavaSparkContext jsc;
+  // config
+  private Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;

[GitHub] [hudi] danfran opened a new issue, #8691: [SUPPORT] Testing Apache Hudi with Glue Image and LocalStack

2023-05-11 Thread via GitHub


danfran opened a new issue, #8691:
URL: https://github.com/apache/hudi/issues/8691

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am trying to run some tests in Docker using the image 
`amazon/aws-glue-libs:glue_libs_4.0.0_image_01` and `localstack` for AWS 
environment. No matter what, every time the tests run, Hudi tries to connect to 
the remote AWS instead of pointing to LocalStack. This is the configuration I 
am using at the moment:
   
   ```
   packages = [
   '/home/glue_user/spark/jars/spark-avro_2.12-3.3.0-amzn-1.jar',
   
'/home/glue_user/aws-glue-libs/datalake-connectors/hudi-0.12.1/hudi-spark3-bundle_2.12-0.12.1.jar',
   '/home/glue_user/aws-glue-libs/jars/aws-java-sdk-1.12.128.jar',
   '/home/glue_user/aws-glue-libs/jars/aws-java-sdk-glue-1.12.128.jar',
   '/home/glue_user/spark/jars/hadoop-aws-3.3.3-amzn-0.jar',
   ]
   
   conf = SparkConf() \
   .set('spark.jars', ','.join(packages))\
   .set('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer')\
   .set('spark.sql.catalog.spark_catalog', 
'org.apache.spark.sql.hudi.catalog.HoodieCatalog')\
   .set('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension')
   
   spark_context = SparkContext(conf=conf)
   glue_context = GlueContext(spark_context)
   spark_session = glue_context.spark_session
   
   # HUDI S3 ACCESS
   spark_session.conf.set('fs.defaultFS', 's3://mybucket')
   spark_session.conf.set('fs.s3.awsAccessKeyId', 'test')
   spark_session.conf.set('fs.s3.awsSecretAccessKey', 'test')
   spark_session.conf.set('fs.s3a.awsAccessKeyId', 'test')
   spark_session.conf.set('fs.s3a.awsSecretAccessKey', 'test')
   spark_session.conf.set('fs.s3a.endpoint', 'http://localstack:4566')
   spark_session.conf.set('fs.s3a.connection.ssl.enabled', 'false')
   spark_session.conf.set('fs.s3a.path.style.access', 'true')
   spark_session.conf.set('fs.s3a.signing-algorithm', 'S3SignerType')
   
spark_session.conf.set('spark.sql.legacy.setCommandRejectsSparkCoreConfs', 
'false')
   
   # SPARK CONF
   spark_session.conf.set('spark.sql.shuffle.partitions', '2')
   spark_session.conf.set('spark.sql.crossJoin.enabled', 'true')
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. 
   2. 
   3. 
   4.
   5.
   6.
   
   **Expected behavior**
   
   How can I make it to point to my local environment (http://localstack:4566) 
instead of AWS remote?
   
   **Environment Description**
   
   * Hudi version : 0.12
   
   * Spark version : 3.3.0
   
   * Hive version :
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```
   An error occurred while calling o1719.save.
   : java.nio.file.AccessDeniedException: 
s3://mybucket/myzone/location/.hoodie: getFileStatus on 
s3://mybucket/myzone/location/.hoodie: 
com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B1ZJ8JDPY2HX514F; 
S3 Extended Request ID: 
rzBqoLQxJb4PSKNW+uCbyVCbqYtpCB0aFHvX7JWTCDJ/PTfQdgESAkOzxWR6aPua8OhuEcajIM8=; 
Proxy: null), S3 Extended Request ID: 
rzBqoLQxJb4PSKNW+uCbyVCbqYtpCB0aFHvX7JUTVIJ/PTfQdgEHNkOzxWR6aPua8OhuEcajIM8=:403
 Forbidden
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Sam-Serpoosh commented on issue #8519: [SUPPORT] Deltastreamer AvroDeserializer failing with java.lang.NullPointerException

2023-05-11 Thread via GitHub


Sam-Serpoosh commented on issue #8519:
URL: https://github.com/apache/hudi/issues/8519#issuecomment-1544320455

   @the-other-tim-brown This is done via:
   
   - Postgres (in AWS RDS) `11.16`
   - Debezium `2.1.2.Final`
   
   While we're at it, these `created_at` and `event_ts` fields are worrisome as 
well I **maybe** since they don't seem to be interpreted as simple/primitive 
`long` types?
   
   ```json
   {
 ...
 "after": {
   "..samser_customers.Value": {
 "id": 1,
 "name": "Bob",
 "age": 40,
 "created_at": {
   "long": 1683661733071814
 },
 "event_ts": {
   "long": 168198480
 }
   }
 },
 ...
   }
   ```
   
   Their corresponding schema portion:
   
   ```json
   {
 ...
 {
   "name": "created_at",
   "type": [
 "null",
 {
   "type": "long",
   "connect.version": 1,
   "connect.name": "io.debezium.time.MicroTimestamp"
 }
   ],
   "default": null
 },
 {
   "name": "event_ts",
   "type": [
 "null",
 "long"
   ],
   "default": null
 }
 ...
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8675: [HUDI-6195] Test-cover different payload classes when use HoodieMergedReadHandle

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8675:
URL: https://github.com/apache/hudi/pull/8675#discussion_r1191341277


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieAdaptablePayloadDataGenerator.java:
##
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.AWSDmsAvroPayload;
+import org.apache.hudi.common.model.DefaultHoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieAvroIndexedRecord;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.MetadataValues;
+import org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.common.model.PartialUpdateAvroPayload;
+import org.apache.hudi.common.model.debezium.DebeziumConstants;
+import org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload;
+import org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+import static 
org.apache.hudi.common.model.HoodieRecord.HOODIE_IS_DELETED_FIELD;
+import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
+
+public class HoodieAdaptablePayloadDataGenerator {
+
+  public static final Schema SCHEMA = 
SchemaTestUtil.getSchemaFromResource(HoodieAdaptablePayloadDataGenerator.class, 
"/adaptable-payload.avsc");
+  public static final Schema SCHEMA_WITH_METAFIELDS = 
HoodieAvroUtils.addMetadataFields(SCHEMA, false);
+  public static final String SCHEMA_STR = SCHEMA.toString();
+
+  public static Properties getKeyGenProps(Class payloadClass) {
+String orderingField = new RecordGen(payloadClass).getOrderingField();
+Properties props = new Properties();
+props.put("hoodie.datasource.write.recordkey.field", "id");
+props.put("hoodie.datasource.write.partitionpath.field", "pt");
+props.put("hoodie.datasource.write.precombine.field", orderingField);
+props.put(HoodieTableConfig.RECORDKEY_FIELDS.key(), "id");
+props.put(HoodieTableConfig.PARTITION_FIELDS.key(), "pt");
+props.put(HoodieTableConfig.PRECOMBINE_FIELD.key(), orderingField);
+return props;
+  }
+
+  public static Properties getPayloadProps(Class payloadClass) {
+String orderingField = new RecordGen(payloadClass).getOrderingField();
+Properties props = new Properties();
+props.put("hoodie.compaction.payload.class", payloadClass.getName());
+props.put("hoodie.payload.event.time.field", orderingField);
+props.put("hoodie.payload.ordering.field", orderingField);
+return props;
+  }
+
+  public static List getInserts(int n, String partition, long 
ts, Class payloadClass) throws IOException {
+return getInserts(n, new String[] {partition}, ts, payloadClass);
+  }
+
+  public static List getInserts(int n, String[] partitions, long 
ts, Class payloadClass) throws IOException {
+List inserts = new ArrayList<>();
+RecordGen recordGen = new RecordGen(payloadClass);
+for (GenericRecord r : getInserts(n, partitions, ts, recordGen)) {
+  inserts.add(getHoodieRecord(r, recordGen.getPayloadClass()));
+}
+return inserts;
+  }
+
+  private static List getInserts(int n, String[] partitions, 
long ts, RecordGen recordGen) {
+return IntStream.range(0, n).mapToObj(id -> {
+  String pt = partitions.length == 0 ? "" : partitions[id % 
partitions.length];
+  return getInsert(id, pt, ts, recordGen);
+}).collect(Collectors.toList());
+  }
+
+  private static GenericRecord getInsert(int id, String pt, long ts, RecordGen 
recordGen) {
+GenericRecord r = new GenericData.Record(SCHEMA);
+

[GitHub] [hudi] yihua commented on a diff in pull request #8679: [DOCS] [RFC-69] Hudi 1.X

2023-05-11 Thread via GitHub


yihua commented on code in PR #8679:
URL: https://github.com/apache/hudi/pull/8679#discussion_r1191314589


##
rfc/rfc-69/rfc-69.md:
##
@@ -0,0 +1,159 @@
+
+# RFC-69: Hudi 1.X
+
+## Proposers
+
+* Vinoth Chandar
+
+## Approvers
+
+*   Hudi PMC
+
+## Status
+
+Under Review
+
+## Abstract
+
+This RFC proposes an exciting and powerful re-imagination of the transactional 
database layer in Hudi to power continued innovation across the community in 
the coming years. We have 
[grown](https://git-contributor.com/?chart=contributorOverTime=apache/hudi)
 more than 6x contributors in the past few years, and this RFC serves as the 
perfect opportunity to clarify and align the community around a core vision. 
This RFC aims to serve as a starting point for this discussion, then solicit 
feedback, embrace new ideas and collaboratively build consensus towards an 
impactful Hudi 1.X vision, then distill down what constitutes the first release 
- Hudi 1.0.
+
+## **State of the Project**
+
+As many of you know, Hudi was originally created at Uber in 2016 to solve 
[large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) 
and [incremental data 
processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems 
and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. 
+Since its graduation as a top-level Apache project in 2020, the community has 
made impressive progress toward the [streaming data lake 
vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) 
to make data lakes more real-time and efficient with incremental processing on 
top of a robust set of platform components. 
+The most recent 0.13 brought together several notable features to empower 
incremental data pipelines, including - [_RFC-51 Change Data 
Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), 
more advanced indexing techniques like [_consistent hash 
indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and 
+novel innovations like [_early conflict 
detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - 
to name a few.
+
+
+
+Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve 
end-end use cases using Hudi as a data lake platform that delivers a 
significant amount of automation on top of an interoperable open storage 
format. 
+Users can ingest incrementally from files/streaming systems/databases and 
insert/update/delete that data into Hudi tables, with a wide selection of 
performant indexes. 
+Thanks to the core design choices like record-level metadata and 
incremental/CDC queries, users are able to consistently chain the ingested data 
into downstream pipelines, with the help of strong stream processing support in 
+recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. 
Hudi's table services automatically kick in across this ingested and derived 
data to manage different aspects of table bookkeeping, metadata and storage 
layout. 
+Finally, Hudi's broad support for different catalogs and wide integration 
across various query engines mean Hudi tables can also be "batch" processed 
old-school style or accessed from interactive query engines.
+
+## **Future Opportunities**
+
+We're adding new capabilities in the 0.x release line, but we can also turn 
the core of Hudi into a more general-purpose database experience for the lake. 
As the first kid on the lakehouse block (we called it "transactional data 
lakes" or "streaming data lakes" 

Review Comment:
   cc @garyli1019 as he had a lot of ideas on this topic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua opened a new pull request, #8690: [WIP][HUDI-6199] Fix deletes with custom payload implementation

2023-05-11 Thread via GitHub


yihua opened a new pull request, #8690:
URL: https://github.com/apache/hudi/pull/8690

   ### Change Logs
   
   Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
release [blocker](https://github.com/apache/hudi/pull/8573), I found that 
custom payload implementation like AWS DMS payload and Debezium payload are not 
properly migrated to the new APIs introduced by RFC-46, causing the delete 
operation to fail.  Our tests did not catch this.  

   It is currently assumed that delete records are marked by 
"_hoodie_is_deleted"; however, custom CDC payloads use op field to mark deletes.
   
   This PR fixes the issue.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6199) CDC payload with op field for deletes do not work

2023-05-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6199:
-
Labels: pull-request-available  (was: )

> CDC payload with op field for deletes do not work
> -
>
> Key: HUDI-6199
> URL: https://issues.apache.org/jira/browse/HUDI-6199
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> Delete operation in custom payload after RFC-46: while looking into a 0.13.1 
> release [blocker|https://github.com/apache/hudi/pull/8573], I found that 
> custom payload implementation like AWS DMS payload and Debezium payload are 
> not properly migrated to the new APIs introduced by RFC-46, causing the 
> delete operation to fail.  Our tests did not catch this.  
>  
> It is currently assumed that delete records are marked by 
> "_hoodie_is_deleted"; however, custom CDC payloads use op field to mark 
> deletes.
>  
> Impact:
> OverwriteWithLatest payload(also OverwriteNonDefaultsWithLatestAvroPayload) 
> are not affected.
> for any other custom payloads: (AWSDMSAvropayload, All debezium payloads) 
> deletes are broken. 
> If someone is using "_is_hoodie_deleted" to enforce deletes, there are no 
> issues w/ custome payloads.
> COW: 
> deleting a non-existant will break if not using "_is_hoodie_deleted" way.
> MOR: 
> any deletes will break if not using "_is_hoodie_deleted" way.
> Writer:
> all writers(spark, flink) except spark-sql.
> DefaultHoodieRecordPayload delete marker support in 0.14.0 is also affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-11 Thread via GitHub


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1190732568


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -892,25 +917,24 @@ public void buildMetadataPartitions(HoodieEngineContext 
engineContext, List inflightIndexes = 
getInflightMetadataPartitions(dataMetaClient.getTableConfig());
-
inflightIndexes.addAll(indexPartitionInfos.stream().map(HoodieIndexPartitionInfo::getMetadataPartitionPath).collect(Collectors.toSet()));
-
dataMetaClient.getTableConfig().setValue(HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT.key(),
 String.join(",", inflightIndexes));
-HoodieTableConfig.update(dataMetaClient.getFs(), new 
Path(dataMetaClient.getMetaPath()), dataMetaClient.getTableConfig().getProps());
-initialCommit(indexUptoInstantTime + METADATA_INDEXER_TIME_SUFFIX, 
partitionTypes);
+
+// before initialization set these  partitions as inflight in table config
+HoodieTableMetadataUtil.setMetadataInflightPartitions(dataMetaClient, 
partitionTypes);

Review Comment:
   wrt 2nd flight. we should be cautious in updating table config on the fly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >