Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

via GitHub Tue, 08 Jul 2025 17:42:02 -0700


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193678166



##########
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##########
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself<sup>[1](#footnote1)</sup>.
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   That is a crazy list -- I am not sure how how to add it to this post without 
overwhelming the narrative though 🤔 



##########
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##########
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself<sup>[1](#footnote1)</sup>.
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   Thank you @comphead . That is a crazy list -- I am not sure how how to add 
it to this post without overwhelming the narrative though 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

Reply via email to