vinothchandar commented on a change in pull request #3515: URL: https://github.com/apache/hudi/pull/3515#discussion_r697440656
########## File path: website/blog/2021-08-20-immutable-data-lakes.md ########## @@ -0,0 +1,73 @@ +--- +title: "Immutable data lakes using Apache Hudi" +excerpt: "How to leverage Apache Hudi for your immutable (or) append only data use-case" +author: shivnarayan +category: blog +--- + +Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need. +We strive to listen to community and build features based on the need. From our interactions with the community, we got +to know there are quite a few use-cases where Hudi is being used for immutable or append only data. This blog will go +over details on how to leverage Apache Hudi in building your data lake for such immutable or append only data. +<!--truncate--> + +# Immutable data +Often times, users route log entries to data lakes, where data is immutable. (Add some concrete +examples here). Data once ingested won't be updated and can only be deleted. Also, most likely, deletes are issued at +partition level (delete partitions older than 1 week) granularity. + +# Immutable data lakes using Apache Hudi +Hudi has an efficient way to ingest data into Hudi for such immutable use-cases. "Bulk_Insert" operation in Hudi is +commonly used for initial bootstrapping of data into hudi, but also exactly fits the bill for such immutable or append +only data. And it is known to be performant when compared to regular "insert"s or "upsert"s. + +## Bulk_insert vs regular Inserts/Upserts +With regular inserts and upserts, Hudi executes few steps before data can be written to data files. For example, +index lookup, small file handling, etc has to be performed before actual write. But with bulk_insert, such overhead can +be avoided since data is known to be immutable. + +Here is an illustration of steps involved in different operations of interest. + +![Inserts/Upserts](/assets/images/blog/immutable_datalakes/immutable_data_lakes1.png) + +_Figure: High level steps on Insert/Upsert operation with Hudi._ + +![Bulk_Insert](/assets/images/blog/immutable_datalakes/immutable_data_lakes2.png) + +_Figure: High level steps on Bulk_insert operation with Hudi._ + +As you could see, bulk_insert skips the unnecessary step of indexing and small file handling which could bring down +your write latency by a large degree for append only data. And bulk_insert also supports "Row writer" path which +is known to be performant compared to Rdd path (WriteClient). So, users can enjoy the blazing fast writes for such +immutable data using bulk_insert operation and row writer. + +:::note +There won't be any small file handling with bulk_insert. But users can choose to leverage Clustering to batch small +files into larger ones if need be. +::: + +## Configurations +Users need to set the write operation config `hoodie.datasource.write.operation` to "bulk_insert". To leverage row +writer, one has to enable `hoodie.datasource.write.row.writer.enable`. Default value of this config is false. + +## Supported Operations +Even though this is catered towards immutable data, all operations are supported for a hudi table in general. Once an +issue deletes, enable metadata, add clustering etc to these tables. Just that users can leverage bulk_insert for faster +writes compared to other operations by bypassing the additional overhead. + +Hudi is also adding Virtual key support in upcoming release, and users can also enable virtual keys for such immutable Review comment: this is out there already? ########## File path: website/blog/2021-08-20-immutable-data-lakes.md ########## @@ -0,0 +1,73 @@ +--- +title: "Immutable data lakes using Apache Hudi" +excerpt: "How to leverage Apache Hudi for your immutable (or) append only data use-case" +author: shivnarayan +category: blog +--- + +Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need. +We strive to listen to community and build features based on the need. From our interactions with the community, we got +to know there are quite a few use-cases where Hudi is being used for immutable or append only data. This blog will go +over details on how to leverage Apache Hudi in building your data lake for such immutable or append only data. +<!--truncate--> + +# Immutable data +Often times, users route log entries to data lakes, where data is immutable. (Add some concrete +examples here). Data once ingested won't be updated and can only be deleted. Also, most likely, deletes are issued at +partition level (delete partitions older than 1 week) granularity. + +# Immutable data lakes using Apache Hudi +Hudi has an efficient way to ingest data into Hudi for such immutable use-cases. "Bulk_Insert" operation in Hudi is +commonly used for initial bootstrapping of data into hudi, but also exactly fits the bill for such immutable or append +only data. And it is known to be performant when compared to regular "insert"s or "upsert"s. + +## Bulk_insert vs regular Inserts/Upserts +With regular inserts and upserts, Hudi executes few steps before data can be written to data files. For example, +index lookup, small file handling, etc has to be performed before actual write. But with bulk_insert, such overhead can +be avoided since data is known to be immutable. + +Here is an illustration of steps involved in different operations of interest. + +![Inserts/Upserts](/assets/images/blog/immutable_datalakes/immutable_data_lakes1.png) + +_Figure: High level steps on Insert/Upsert operation with Hudi._ + +![Bulk_Insert](/assets/images/blog/immutable_datalakes/immutable_data_lakes2.png) + +_Figure: High level steps on Bulk_insert operation with Hudi._ + +As you could see, bulk_insert skips the unnecessary step of indexing and small file handling which could bring down +your write latency by a large degree for append only data. And bulk_insert also supports "Row writer" path which +is known to be performant compared to Rdd path (WriteClient). So, users can enjoy the blazing fast writes for such +immutable data using bulk_insert operation and row writer. + +:::note +There won't be any small file handling with bulk_insert. But users can choose to leverage Clustering to batch small +files into larger ones if need be. +::: + +## Configurations +Users need to set the write operation config `hoodie.datasource.write.operation` to "bulk_insert". To leverage row +writer, one has to enable `hoodie.datasource.write.row.writer.enable`. Default value of this config is false. Review comment: its not false anymore right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org