[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

GitBox Mon, 02 May 2022 11:02:53 -0700


alexeykudinkin commented on code in PR #5440:
URL: https://github.com/apache/hudi/pull/5440#discussion_r863063103



##########
website/docs/performance.md:
##########
@@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 
95% inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a 
challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers 
a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction 
& efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi 
tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that 
leverages files metadata to very effectively prune the search space, by 
+avoiding reading (even footers of) the files that are known (based on the 
metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" 
src={require("/assets/images/hudi_query_perf_hive.png").default} 
alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing 
column-level statistics (such as min-value, max-value, count of null-values in 
the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming 
query instead of enumerating every file in the table and reading its 
corresponding metadata 
+(for ex, Parquet footers) for analysis whether it could contain any data 
matching the query filters, to simply do a query against a Column Stats Index 
+in the Metadata Table (which in turn is a Hudi table itself) and within 
seconds (even for TBs scale tables, with 10s of thousands of files) obtain the 
list 
+of _all the files that might potentially contain the data_ matching query's 
filters with crucial property that files that could be ruled out as not 
containing such data
+(based on their column-level statistics) will be stripped out.
 
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using 
Physical Partitioning where records in the dataset are partitioned on disk

Review Comment:
   Good call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

Reply via email to