pratyakshsharma commented on code in PR #8679: URL: https://github.com/apache/hudi/pull/8679#discussion_r1192795196
########## rfc/rfc-69/rfc-69.md: ########## @@ -0,0 +1,159 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime&repo=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" Review Comment: nit: We're -> We have been ########## rfc/rfc-69/rfc-69.md: ########## @@ -0,0 +1,159 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime&repo=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for Hudi tables. +* **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver +(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled +due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open +solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data. +* **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. +Using Hudi's HFile tables for ML Model serving is an emerging use case with users who want a lower-cost, lightweight means to serve computed data directly off the lake storage. Often, unstructured data like JSON and blobs +like images must be pseudo-modeled with some structure, leading to poor performance or manageability. With the meteoric rise of AI/ML in recent years, the lack of support for complex, unstructured, large blobs in a project like Hudi will only fragment data in lakes. +To this end, we need to support all the major image, video and ML/AI formats with the same depth of capabilities around indexing, mutating or capturing changes. +* **Even greater self-management**: Hudi offers the most extensive set of capabilities today in open-source data lake management, from ingesting data to optimizing data and automating various bookkeeping activities to +automatically manage table data and metadata. Seeing how the community has used this management layer to up-level their data lake experience is impressive. However, we have plenty of capabilities to be added, e.g., +reverse streaming data into other systems or [snapshot management](https://github.com/apache/hudi/pull/6576/files) or [diagnostic reporters](https://github.com/apache/hudi/pull/6600) or cross-region logical replication or +record-level [time-to-live management](https://github.com/apache/hudi/pull/8062), to name a few. + +## **Hudi 1.X** + +Given that we have approached Hudi more like a database problem, it's unsurprising that Hudi has many building blocks that make up a database. Drawing a baseline from the +seminal [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf) paper (see page 4), we can see how Hudi makes up the bottom half of a database optimized for the lake, +with multiple query engines layers - SQL, programmatic access, specialized for ML/AI, real-time analytics and other engines sitting on top. The major areas below directly map how we have tracked +the Hudi [roadmap](https://hudi.apache.org/roadmap). We will see how we have adapted these components specifically for the scale of data lakes and the characteristics of lake workloads. + +![](./hudi-dblayers.png) + +_Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue)._ + + + +The log manager component in a database helps organize logs for recovery of the database during crashes, among other things. At the transactional layer, Hudi implements ways to organize data into file groups and +file slices and stores events that modify the table state in a timeline. Hudi also tracks inflight transactions using marker files for effective rollbacks. Since the lake stores way more data than typical operational +databases or data warehouses while needing much longer record version tracking, Hudi generates record-level metadata that compresses well to aid in features like change data capture or incremental queries, effectively +treating data itself as a log. In the future, we would want to continue improving the data organization in Hudi, to provide scalable, infinite timeline and data history, time-travel writes, storage federation and other features. + + + +The lock manager component helps implement concurrency control mechanisms in a database. Hudi ships with several external lock managers, although we would want to ultimately streamline this through +our [metaserver](https://github.com/apache/hudi/pull/4718) that serves only timeline metadata today. The paper (pg 81) describes the tradeoffs between the common concurrency control techniques in +databases: _2Phase Locking_ (hard to implement without a central transaction manager), _OCC_ (works well w/o contention, fails very poorly with contention) and _MVCC_ (yields high throughputs, but relaxed +serializability in some cases). Hudi implements OCC between concurrent writers while providing MVCC-based concurrency for writers and table services to avoid any blocking between them. Taking a step back, +we need to ask ourselves if we are building an OLTP relational database to avoid falling into the trap of blindly applying the same concurrency control techniques that apply to them to the high-throughput +pipelines/jobs writing to the lake. Hudi has a less enthusiastic view of OCC and encourages serializing updates/deletes/inserts through the input stream to avoid performance penalties with OCC for fast-mutating +tables or streaming workloads. Even as we implemented techniques like Early Conflict Detection to improve OCC, this RFC proposes Hudi should pursue a more general purpose non-blocking MVCC-based concurrency control +while retaining OCC for simple and batch append-only use cases. + + + +The access methods component encompasses indexes, metadata and storage layout organization techniques exposed to reads/writes on the database. Last year, we added +a new [multi-modal index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi) with support for [asynchronous index building](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) based +on MVCC to build indexes without blocking writers and still be consistent upon completion with the table data. Our focus has thus far been more narrowly aimed at using indexing techniques for write performance, +while queries benefit from files and column statistics metadata for planning. In the future, we want to generalize support for using various index types uniformly across writes and queries so that queries can be planned, +optimized and executed efficiently on top of Hudi's indices. This is now possible due to having Hudi's connectors for popular open-source engines like Presto, Spark and Trino. +New [secondary indexing schemes](https://github.com/apache/hudi/pull/5370) and a proposal for built-in index functions to index values derived from columns have already been added. + +The buffer manager component manages dirtied blocks of storage and also caches data for faster query responses. In Hudi's context, we want to bring to life our now long-overdue +columnar [caching service](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#lake-cache) that can sit transparently between lake storage and query engines while understanding +transaction boundaries and record mutations. The tradeoffs in designing systems that balance read, update and memory costs are detailed in the [RUM conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf). +Our basic idea here is to optimize for read (faster queries served out of cache) and update (amortizing MoR merge costs by continuously compacting in-memory) costs while adding the cost of cache/memory to the system. +Currently, there are potentially many candidate designs for this idea, and we would need a separate design/RFC to pursue them. + +Shared components include replication, loading, and various utilities, complete with a catalog or metadata server. Most databases hide the underlying format/storage complexities, providing users +with many data management tools. Hudi is no exception. Hudi has battle-hardened bulk and continuous data loading utilities (deltastreamer, flinkstreamer tools, along with Kafka Connect Sink), +a comprehensive set of table services (cleaning, archival, compaction, clustering, indexing, ..), admin CLI and much much more. The community has been working on new server components +like a [metaserver](https://github.com/apache/hudi/pull/4718) that could expand to indexing the table metadata using advanced data structures like zone maps/interval trees or a [table management service](https://github.com/apache/hudi/pull/4309) to manage Hudi tables +centrally. We would love to evolve towards having a set of horizontally scalable, highly available metaservers, that can provide both these functionalities as well as some of the lock management capabilities. +Another interesting direction to pursue would be a reverse loader/streamer utility that can also move data out of Hudi into other external storage systems. + +In all, we propose Hudi 1.x as a reimagination of Hudi, as the _transactional database for the lake_, with [polyglot persistence](https://en.wikipedia.org/wiki/Polyglot_persistence), raising the level of +abstraction and platformization even higher for Hudi data lakes. Review Comment: +1 ########## rfc/rfc-69/rfc-69.md: ########## @@ -0,0 +1,159 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-69: Hudi 1.X + +## Proposers + +* Vinoth Chandar + +## Approvers + +* Hudi PMC + +## Status + +Under Review + +## Abstract + +This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime&repo=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. + +## **State of the Project** + +As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. +Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. +The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and +novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. + + + +Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. +Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. +Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in +recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. +Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. + +## **Future Opportunities** + +We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" +to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. + +* **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions +around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, +Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. +* **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for +keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, +relational data model for Hudi tables. +* **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver +(similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled +due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open +solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data. +* **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. +Using Hudi's HFile tables for ML Model serving is an emerging use case with users who want a lower-cost, lightweight means to serve computed data directly off the lake storage. Often, unstructured data like JSON and blobs +like images must be pseudo-modeled with some structure, leading to poor performance or manageability. With the meteoric rise of AI/ML in recent years, the lack of support for complex, unstructured, large blobs in a project like Hudi will only fragment data in lakes. +To this end, we need to support all the major image, video and ML/AI formats with the same depth of capabilities around indexing, mutating or capturing changes. +* **Even greater self-management**: Hudi offers the most extensive set of capabilities today in open-source data lake management, from ingesting data to optimizing data and automating various bookkeeping activities to +automatically manage table data and metadata. Seeing how the community has used this management layer to up-level their data lake experience is impressive. However, we have plenty of capabilities to be added, e.g., +reverse streaming data into other systems or [snapshot management](https://github.com/apache/hudi/pull/6576/files) or [diagnostic reporters](https://github.com/apache/hudi/pull/6600) or cross-region logical replication or +record-level [time-to-live management](https://github.com/apache/hudi/pull/8062), to name a few. + +## **Hudi 1.X** + +Given that we have approached Hudi more like a database problem, it's unsurprising that Hudi has many building blocks that make up a database. Drawing a baseline from the +seminal [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf) paper (see page 4), we can see how Hudi makes up the bottom half of a database optimized for the lake, +with multiple query engines layers - SQL, programmatic access, specialized for ML/AI, real-time analytics and other engines sitting on top. The major areas below directly map how we have tracked +the Hudi [roadmap](https://hudi.apache.org/roadmap). We will see how we have adapted these components specifically for the scale of data lakes and the characteristics of lake workloads. + +![](./hudi-dblayers.png) + +_Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue)._ + + + +The log manager component in a database helps organize logs for recovery of the database during crashes, among other things. At the transactional layer, Hudi implements ways to organize data into file groups and +file slices and stores events that modify the table state in a timeline. Hudi also tracks inflight transactions using marker files for effective rollbacks. Since the lake stores way more data than typical operational +databases or data warehouses while needing much longer record version tracking, Hudi generates record-level metadata that compresses well to aid in features like change data capture or incremental queries, effectively +treating data itself as a log. In the future, we would want to continue improving the data organization in Hudi, to provide scalable, infinite timeline and data history, time-travel writes, storage federation and other features. Review Comment: time-travel writes looks interesting to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org