This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 601f54f [HUDI-1563] Adding hudi file sizing/ small file management blog (#2612) 601f54f is described below commit 601f54f1ea215281ede51125872d5c2455077dba Author: Sivabalan Narayanan <sivab...@uber.com> AuthorDate: Mon Mar 15 11:18:57 2021 -0400 [HUDI-1563] Adding hudi file sizing/ small file management blog (#2612) Co-authored-by: Vinoth Chandar <vin...@apache.org> --- docs/_posts/2021-03-01-hudi-file-sizing.md | 85 +++++++++++++++++++++ .../blog/hudi-file-sizing/adding_new_files.png | Bin 0 -> 44237 bytes .../bin_packing_existing_data_files.png | Bin 0 -> 23955 bytes .../blog/hudi-file-sizing/initial_layout.png | Bin 0 -> 34742 bytes 4 files changed, 85 insertions(+) diff --git a/docs/_posts/2021-03-01-hudi-file-sizing.md b/docs/_posts/2021-03-01-hudi-file-sizing.md new file mode 100644 index 0000000..c79ea80 --- /dev/null +++ b/docs/_posts/2021-03-01-hudi-file-sizing.md @@ -0,0 +1,85 @@ +--- +title: "Streaming Responsibly - How Apache Hudi maintains optimum sized files" +excerpt: "Maintaining well-sized files can improve query performance significantly" +author: shivnarayan +category: blog +--- + +Apache Hudi is a data lake platform technology that provides several functionalities needed to build and manage data lakes. +One such key feature that hudi provides is self-managing file sizing so that users don’t need to worry about +manual table maintenance. Having a lot of small files will make it harder to achieve good query performance, due to query engines +having to open/read/close files way too many times, to plan and execute queries. But for streaming data lake use-cases, +inherently ingests are going to end up having smaller volume of writes, which might result in lot of small files if no special handling is done. + +# During Write vs After Write + +Common approaches to writing very small files and then later stitching them together solve for system scalability issues posed +by small files but might violate query SLA's by exposing small files to them. In fact, you can easily do so on a Hudi table, +by running a clustering operation, as detailed in a [previous blog](/blog/hudi-clustering-intro/). + +In this blog, we discuss file sizing optimizations in Hudi, during the initial write time, so we don't have to effectively +re-write all data again, just for file sizing. If you want to have both (a) self managed file sizing and +(b) Avoid exposing small files to queries, automatic file sizing feature saves the day. + +Hudi has the ability to maintain a configured target file size, when performing inserts/upsert operations. +(Note: bulk_insert operation does not provide this functionality and is designed as a simpler replacement for +normal `spark.write.parquet`). + +## Configs + +For illustration purposes, we are going to consider only COPY_ON_WRITE table. + +Configs of interest before we dive into the algorithm: + +- [Max file size](/docs/configurations.html#limitFileSize): Max size for a given data file. Hudi will try to maintain file sizes to this configured value <br/> +- [Soft file limit](/docs/configurations.html#compactionSmallFileSize): Max file size below which a given data file is considered to a small file <br/> +- [Insert split size](/docs/configurations.html#insertSplitSize): Number of inserts grouped for a single partition. This value should match +the number of records in a single file (you can determine based on max file size and per record size) + +For instance, if your first config value is 120MB and 2nd config value is set to 100MB, any file whose size is < 100MB +would be considered a small file. + +If you wish to turn off this feature, set the config value for soft file limit to 0. + +## Example + +Let’s say this is the layout of data files for a given partition. + +![Initial layout](/assets/images/blog/hudi-file-sizing/initial_layout.png) +_Figure: Initial data file sizes for a given partition of interest_ + +Let’s assume the configured values for max file size and small file size limit are 120MB and 100MB. File_1’s current +size is 40MB, File_2’s size is 80MB, File_3’s size is 90MB, File_4’s size is 130MB and File_5’s size is 105MB. Let’s see +what happens when a new write happens. + +**Step 1:** Assigning updates to files. In this step, We look up the index to find the tagged location and records are +assigned to respective files. Note that we assume updates are only going to increase the file size and that would simply result +in a much bigger file. When updates lower the file size (by say, nulling out lot of fields), then a subsequent write will deem +it a small file. + +**Step 2:** Determine small files for each partition path. The soft file limit config value will be leveraged here +to determine eligible small files. In our example, given the config value is set to 100MB, the small files are File_1(40MB) +and File_2(80MB) and file_3’s (90MB). + +**Step 3:** Once small files are determined, incoming inserts are assigned to them so that they reach their max capacity of +120MB. File_1 will be ingested with 80MB worth of inserts, file_2 will be ingested with 40MB worth of inserts and +File_3 will be ingested with 30MB worth of inserts. + +![Bin packing small files](/assets/images/blog/hudi-file-sizing/bin_packing_existing_data_files.png) +_Figure: Incoming records are bin packed to existing small files_ + +**Step 4:** Once all small files are bin packed to its max capacity and if there are pending inserts unassigned, new file +groups/data files are created and inserts are assigned to them. Number of records per new data file is determined from insert split +size config. Assuming the insert split size is configured to 120k records, if there are 300k remaining records, 3 new +files will be created in which 2 of them (File_6 and File_7) will be filled with 120k records and the last one (File_8) +will be filled with 60k records (assuming each record is 1000 bytes). In future ingestions, 3rd new file will be +considered as a small file to be packed with more data. + +![Assigning to new files](/assets/images/blog/hudi-file-sizing/adding_new_files.png) +_Figure: Remaining records are assigned to new files_ + +Hudi leverages mechanisms such as custom partitioning for optimized record distribution to different files, executing +the algorithm above. After this round of ingestion is complete, all files except File_8 are nicely sized to the optimum size. +This process is followed during every ingestion to ensure there are no small files in your Hudi tables. + +Hopefully the blog gave you an overview into how hudi manages small files and assists in boosting your query performance. diff --git a/docs/assets/images/blog/hudi-file-sizing/adding_new_files.png b/docs/assets/images/blog/hudi-file-sizing/adding_new_files.png new file mode 100644 index 0000000..f61cd89 Binary files /dev/null and b/docs/assets/images/blog/hudi-file-sizing/adding_new_files.png differ diff --git a/docs/assets/images/blog/hudi-file-sizing/bin_packing_existing_data_files.png b/docs/assets/images/blog/hudi-file-sizing/bin_packing_existing_data_files.png new file mode 100644 index 0000000..324c7fc Binary files /dev/null and b/docs/assets/images/blog/hudi-file-sizing/bin_packing_existing_data_files.png differ diff --git a/docs/assets/images/blog/hudi-file-sizing/initial_layout.png b/docs/assets/images/blog/hudi-file-sizing/initial_layout.png new file mode 100644 index 0000000..ae0e9a1 Binary files /dev/null and b/docs/assets/images/blog/hudi-file-sizing/initial_layout.png differ