[GitHub] [hudi] nsivabalan commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

GitBox Thu, 28 Apr 2022 18:18:59 -0700


nsivabalan commented on code in PR #5440:
URL: https://github.com/apache/hudi/pull/5440#discussion_r861409940



##########
website/docs/performance.md:
##########
@@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 
95% inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that 
leverages files metadata to very effectively prune the search space, by 
+avoiding reading (even footers of) the files that are known (based on the 
metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_hive.png").default} 
alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing 
column-level statistics (such as min-value, max-value, count of null-values in 
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming 
query instead of enumerating every file in the table and reading its 
corresponding metadata 
+(for ex, Parquet footers) for analysis whether it could contain any data 
matching the query filters, to simply do a query against a Column Stats Index 
+in the Metadata Table (which in turn is a Hudi table itself) and within 
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the 
list 
+of _all the files that might potentially contain the data_ matching query's 
filters with crucial property that files that could be ruled out as not 
containing such data
+(based on their column-level statistics) will be stripped out.
 
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using 
Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative 
(clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a 
mapping "file &rarr; columns' statistics" for all of the columns persisted 
+within that file.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_spark.png").default} 
alt="hudi_query_perf_spark.png"  />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could 
+1. Substantially improve query execution runtime (by avoiding fruitless 
Compute churn) in excess of **10x** as compared to the same query on the same 
dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many 
requests, for ex, AWS limits # of requests / sec that could be issued based on 
the object's prefix which considerably complicates things for partitioned 
tables)  
 
-**Presto**
+If you're interested on learning more details around how Data Skipping is 
working internally please watch out for a blog-post coming out on this soon!  
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_presto.png").default} 
alt="hudi_query_perf_presto.png"  />
-</figure>
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_ 
(TODO(alexey) add ref to async indexer)
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make 
sure 
+following properties are set to true:
+  - `hoodie.metadata.enable` (to enable Metadata Table on the write path, 
enabled by default)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index 
being populated on the write path, disabled by default)
+
+TODO(alexey) add ref to async indexer docs

Review Comment:
   fix all todos before landing w/o fail



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

Reply via email to