zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186552221
########## content/blog/datafusion-custom-parquet-index.md: ########## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in +> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714), +> which keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## Background + +Several examples in the DataFusion repository illustrate the benefits of using external indexes for pruning: + +* [`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs) +* [`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) + +Those demos work by building separate index files (Bloom filters, maps of distinct values) and associating them with Parquet files. While effective, this approach: + +* **Increases operational complexity:** Two files per dataset to track. +* **Risks synchronization issues:** Removing or renaming one file breaks the index. +* **Reduces portability:** Harder to share or move Parquet data when the index is external. + +Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: Review Comment: Good suggestion @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org