kazdy commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1191580728
########## rfc/rfc-69/rfc-69.md: ########## @@ -0,0 +1,159 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime&repo=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for Hudi tables. Review Comment: > > How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly? > > I think we can study the experience offered by json/document stores more here. and see if we can borrow some approaches.. e.g Mongo. wdyt? @vinothchandar I don't know how well both can be supported at the same time. Eg in sql server when json data is stored in a column they parse it and store it internally in a normalized form in internal tables and then when a user issues a query to read it then it's parsed back to json format. This is clever but feels clunky at the same time. On the other hand, Mongo is super flexible but is/was? not that great at joins so that is not aligned with the relational model. I myself sometimes keep json data as strings in Hudi tables, but that's mostly because I was scared of schema evolution and compatibility issues and I can not be sure what will I get from the data producer. At the same time, I wanted to use some of Hudi capabilities. So for semi-structured data solution can be to introduce super flexibility with schemas in Hudi + maybe add another base format like BSON to support this flexibility easier. But might be you are thinking about something far beyond this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org