Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

via GitHub Fri, 04 Jul 2025 22:17:01 -0700


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186696545



##########
content/blog/datafusion-custom-parquet-index.md:
##########
@@ -0,0 +1,251 @@
+## Extending Parquet with Embedded Indexes and Accelerating Query Processing 
with DataFusion
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything "smarter" requires inventing 
a whole new file format. In fact, Parquet's column‑oriented design, with its 
well‑defined footer metadata and reserved byte regions, already provides the 
flexibility to embed arbitrary indexing structures without breaking 
compatibility. 
+
+In this post, we'll first review the core concepts of the Apache Parquet file 
format. Then explain how to store custom indexes inside Parquet files, and 
finally show how Apache DataFusion can leverage a **compact distinct‑value 
index** to achieve ultra‑fast file‑level pruning—all while preserving complete 
interchangeability with other tools.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. 
+For example, rewriting ClickBench partitioned dataset with better settings* 
(not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149). 
[JigaoLuo](https://github.com/JigaoLuo) and 
[XiangpengHao](https://github.com/XiangpengHao) have been exploring these 
Parquet‑rewriting techniques over in the liquid‑cache which is using 
DataFusion, repo—check out 
[XiangpengHao/liquid‑cache#227](https://github.com/XiangpengHao/liquid-cache/issues/227)
 for more insights.
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** Requires **arrow‑rs v55.2.0** or later, which includes the 
new “buffered write” API 
([apache/arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)).  
+> This API keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## 1. Parquet 101: File Anatomy & Native Pruning Hooks
+TODO add image here?

Review Comment:
   @alamb  I tried to add the image, but it seems not showing well for my local 
preview, i am not sure why, so i add todo here...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

Reply via email to