alexeykudinkin commented on code in PR #5440: URL: https://github.com/apache/hudi/pull/5440#discussion_r863063103
########## website/docs/performance.md: ########## @@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a event **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a challenging workload like an '100% update' database ingestion workload spanning 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers a **80-100% speedup**. -### Snapshot Queries -The major design goal for snapshot queries is to achieve the latency reduction & efficiency gains in previous section, -with no impact on queries. Following charts compare the Hudi vs non-Hudi tables across Hive/Presto/Spark queries and demonstrate this. +### Read Path -**Hive** +#### Data Skipping + +Data Skipping is a technique (originally introduced in Hudi 0.10) that leverages files metadata to very effectively prune the search space, by +avoiding reading (even footers of) the files that are known (based on the metadata) to only contain the data that _does not match_ the query's filters. -<figure> - <img className="docimage" src={require("/assets/images/hudi_query_perf_hive.png").default} alt="hudi_query_perf_hive.png" /> -</figure> +Data Skipping is leveraging Metadata Table's Column Stats Index bearing column-level statistics (such as min-value, max-value, count of null-values in the column, etc) +for every file of the Hudi table. This then allows Hudi for every incoming query instead of enumerating every file in the table and reading its corresponding metadata +(for ex, Parquet footers) for analysis whether it could contain any data matching the query filters, to simply do a query against a Column Stats Index +in the Metadata Table (which in turn is a Hudi table itself) and within seconds (even for TBs scale tables, with 10s of thousands of files) obtain the list +of _all the files that might potentially contain the data_ matching query's filters with crucial property that files that could be ruled out as not containing such data +(based on their column-level statistics) will be stripped out. -**Spark** +In spirit, Data Skipping is very similar to Partition Pruning for tables using Physical Partitioning where records in the dataset are partitioned on disk Review Comment: Good call -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org