zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186543546


##########
content/blog/datafusion-custom-parquet-index.md:
##########
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in

Review Comment:
   Good point @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to