[hudi] branch asf-site updated: [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study (#4260)

bhavanisudha Thu, 09 Dec 2021 10:10:06 -0800

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 8373c8b  [HUDI-2821] - Docs for Metadata Table - added reference to 
vc's benchmark study (#4260)
8373c8b is described below

commit 8373c8b1dc16f759bc5ed902daa1c9f67c41a6de
Author: Kyle Weller <kywe...@gmail.com>
AuthorDate: Thu Dec 9 10:09:36 2021 -0800

    [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark 
study (#4260)
    
    * added reference to vc's benchmark study
    
    * moved metadata to concepts
---
 website/docs/metadata.md | 25 +++++++++++++++++--------
 website/sidebars.js      |  2 +-
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index 13cf669..ec4b2ee 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -5,14 +5,23 @@ keywords: [ hudi, metadata, S3 file listings]
 
 ## Motivation for a Metadata Table
 
-The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The main purpose of the 
-Metadata Table is:
-
-1. **Eliminate the requirement for the "list files" operation:**
-   1. When reading and writing data, file listing operations are performed to 
get the current view of the file system.
-      When data sets are large, listing all the files becomes a performance 
bottleneck and in the case of cloud storage systems
-      like AWS S3, sometimes causes throttling due to list operation request 
limits. The Metadata Table will instead
-      proactively maintain the list of files and remove the need for recursive 
file listing operations.
+The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The main purpose of the
+Metadata Table is to eliminate the requirement for the "list files" operation.
+
+When reading and writing data, file listing operations are performed to get 
the current view of the file system.
+When data sets are large, listing all the files may be a performance 
bottleneck, but more importantly in the case of cloud storage systems
+like AWS S3, the large number of file listing requests sometimes causes 
throttling due to certain request limits.
+The Metadata Table will instead proactively maintain the list of files and 
remove the need for recursive file listing operations
+
+### Some numbers from a study:
+Running a TPCDS benchmark the p50 list latencies for a single folder scales 
~linearly with the amount of files/objects:
+
+|Number of files/objects|100|1K|10K|100K|
+|---|---|---|---|---|
+|P50 list latency|50ms|131ms|1062ms|9932ms|
+
+Whereas listings from the Metadata Table will not scale linearly with 
file/object count and instead take about 100-500ms per read even for very large 
tables.
+Even better, the timeline server caches portions of the metadata (currently 
only for writers), and provides ~10ms performance for listings.
 
 ## Enable Hudi Metadata Table
 The Hudi Metadata Table is not enabled by default. If you wish to turn it on 
you need to enable the following configuration:
diff --git a/website/sidebars.js b/website/sidebars.js
index 23b9227..adc3cfb 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -26,6 +26,7 @@ module.exports = {
                 'table_types',
                 'indexing',
                 'file_layouts',
+                'metadata',
                 'write_operations',
                 'schema_evolution',
                 'key_generation',
@@ -54,7 +55,6 @@ module.exports = {
                 'transforms',
                 'markers',
                 'file_sizing',
-                'metadata',
                 'snapshot_exporter',
                 'precommit_validator'
             ],

[hudi] branch asf-site updated: [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study (#4260)

Reply via email to