This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 8373c8b [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study (#4260) 8373c8b is described below commit 8373c8b1dc16f759bc5ed902daa1c9f67c41a6de Author: Kyle Weller <kywe...@gmail.com> AuthorDate: Thu Dec 9 10:09:36 2021 -0800 [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study (#4260) * added reference to vc's benchmark study * moved metadata to concepts --- website/docs/metadata.md | 25 +++++++++++++++++-------- website/sidebars.js | 2 +- 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/website/docs/metadata.md b/website/docs/metadata.md index 13cf669..ec4b2ee 100644 --- a/website/docs/metadata.md +++ b/website/docs/metadata.md @@ -5,14 +5,23 @@ keywords: [ hudi, metadata, S3 file listings] ## Motivation for a Metadata Table -The Apache Hudi Metadata Table can significantly improve read/write performance of your queries. The main purpose of the -Metadata Table is: - -1. **Eliminate the requirement for the "list files" operation:** - 1. When reading and writing data, file listing operations are performed to get the current view of the file system. - When data sets are large, listing all the files becomes a performance bottleneck and in the case of cloud storage systems - like AWS S3, sometimes causes throttling due to list operation request limits. The Metadata Table will instead - proactively maintain the list of files and remove the need for recursive file listing operations. +The Apache Hudi Metadata Table can significantly improve read/write performance of your queries. The main purpose of the +Metadata Table is to eliminate the requirement for the "list files" operation. + +When reading and writing data, file listing operations are performed to get the current view of the file system. +When data sets are large, listing all the files may be a performance bottleneck, but more importantly in the case of cloud storage systems +like AWS S3, the large number of file listing requests sometimes causes throttling due to certain request limits. +The Metadata Table will instead proactively maintain the list of files and remove the need for recursive file listing operations + +### Some numbers from a study: +Running a TPCDS benchmark the p50 list latencies for a single folder scales ~linearly with the amount of files/objects: + +|Number of files/objects|100|1K|10K|100K| +|---|---|---|---|---| +|P50 list latency|50ms|131ms|1062ms|9932ms| + +Whereas listings from the Metadata Table will not scale linearly with file/object count and instead take about 100-500ms per read even for very large tables. +Even better, the timeline server caches portions of the metadata (currently only for writers), and provides ~10ms performance for listings. ## Enable Hudi Metadata Table The Hudi Metadata Table is not enabled by default. If you wish to turn it on you need to enable the following configuration: diff --git a/website/sidebars.js b/website/sidebars.js index 23b9227..adc3cfb 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -26,6 +26,7 @@ module.exports = { 'table_types', 'indexing', 'file_layouts', + 'metadata', 'write_operations', 'schema_evolution', 'key_generation', @@ -54,7 +55,6 @@ module.exports = { 'transforms', 'markers', 'file_sizing', - 'metadata', 'snapshot_exporter', 'precommit_validator' ],