Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-14 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3071771783

   > @zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would 
it be okay if I make a copy of your post and include my affiliation (Systems 
Group @ TU Darmstadt)?
   
   Yes, of course, feel free to do it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-14 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3070821888

   > @zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would 
it be okay if I make a copy of your post and include my affiliation (Systems 
Group @ TU Darmstadt)?
   
   Yes of course. 
   
   Perhaps you could make a PR update the  post itself. To do so you could make 
a PR to modify 
https://github.com/apache/datafusion-site/blob/main/content/blog/2025-03-20-parquet-pruning.md
 
   
   
   We could also add an "about the authors" section to the post itself. For 
example the "About the Authors" section from 
https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one/
 is from 
https://github.com/apache/datafusion-site/blob/61aa76e60324ac0d51ed19b7d8f0346624dcc5d4/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md?plain=1#L217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-14 Thread via GitHub


JigaoLuo commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3070337688

   @zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it 
be okay if I make a copy of your post and include my affiliation (Systems Group 
@ TU Darmstadt)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-14 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3069604513

   Thanks again everyone -- now time to make some noise on the social medias
   
   The blog is published here: 
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
   
   Thanks again @JigaoLuo and @zhuqi-lucas  -- I think this post will become an 
important part of the parquet conversation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-14 Thread via GitHub


alamb merged PR #79:
URL: https://github.com/apache/datafusion-site/pull/79


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-09 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3052718193

   > > I made some changes based latest comments from folks.
   > > FYI @alamb , please correct me if i made some wrong changes, thanks a 
lot!
   > 
   > THank you -- it is looking great. I spent some time obsessing over the 
wording some more (probably unnecessarily) but I am so stoked about this post I 
can't really help myself
   
   Thank you @alamb , it looks great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3050686286

   > I made some changes based latest comments from folks.
   > 
   > FYI @alamb , please correct me if i made some wrong changes, thanks a lot!
   
   THank you -- it is looking great. I spent some time obsessing over the 
wording some more (probably unnecessarily) but I am pretty stoked about this 
post


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193678166


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   That is a crazy list -- I am not sure how how to add it to this post without 
overwhelming the narrative though 🤔 



##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   Thank you @comphead . That is a crazy list -- I am not sure how how to add 
it to this post without overwhelming the narrative though 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193677635


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metadata. T

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193539118


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,578 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+**Example scenario:**  
+Imagine your data is partitioned by a `Nation` column (dozens of distinct 
values) across thousands of Parquet files. You execute:
+
+```sql
+  SELECT AVG(sales_amount)
+  FROM sales
+  WHERE nation = 'Singapore'
+  GROUP BY year;
+```
+
+Relying on min/max statistics alone isn’t very selective when a file’s Nation 
range spans “Argentina” through “Zimbabwe,” and Bloom filters still incur 
nontrivial I/O to load per file. Instead, you can store—in each file’s footer 
metadata—a compact list of every distinct nation value present. At query time, 
your engine reads just that tiny list to determine which files cannot contain 
'Singapore' and skips them entirely. This yields dramatically better 
file‑pruning performance, all while preserving full compatibility with standard 
Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. Apache DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


kevinjqliu commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193162144


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,578 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+**Example scenario:**  
+Imagine your data is partitioned by a `Nation` column (dozens of distinct 
values) across thousands of Parquet files. You execute:
+
+```sql
+  SELECT AVG(sales_amount)
+  FROM sales
+  WHERE nation = 'Singapore'
+  GROUP BY year;
+```
+
+Relying on min/max statistics alone isn’t very selective when a file’s Nation 
range spans “Argentina” through “Zimbabwe,” and Bloom filters still incur 
nontrivial I/O to load per file. Instead, you can store—in each file’s footer 
metadata—a compact list of every distinct nation value present. At query time, 
your engine reads just that tiny list to determine which files cannot contain 
'Singapore' and skips them entirely. This yields dramatically better 
file‑pruning performance, all while preserving full compatibility with standard 
Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. Apache DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page I

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


kevinjqliu commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3049874530

   I can render it locally. also #86 should make local dev easier 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub


comphead commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3049815105

   Appreciate if anyone can tell if its possible to read the blog draft 
compiled with formatting? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3047508490

   I made some changes based latest comments from folks.
   
   FYI @alamb , please correct me if i made some wrong changes, thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191553716


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+

Review Comment:
   Thank you @2010YOUY01 for good suggestion, addressed this comment in latest 
PR! 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191570344


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metad

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191564872


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metad

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191553716


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+

Review Comment:
   Thank you @2010YOUY01 for good suggestion, addressed this comment in latest 
PR! 
   
   FYI @alamb , please correct me if i made some wrong changes, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191553716


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+

Review Comment:
   Thank you @2010YOUY01 for good suggestion, addressed this comment in latest 
PR!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191473281


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).

Review Comment:
   Thank you @comphead !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


2010YOUY01 commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191443341


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+

Review Comment:
   I think adding a concrete example here—specifically about the custom DV 
index code example featured in this blog—can help keep readers engaged.
   
   
   
   Example scenario:
   
   Suppose you have a dataset roughly partitioned by `Nation` column with 
several dozen cardinality, and the dataset has thousands of partitioned files.
   We have a analytical query with a selective predicate on `Nation` column: 
   ```sql
   SELECT AVG(sales_amount)
   FROM sales
   WHERE nation = 'Singapore'
   GROUP BY year;
   ```
   
   Ideally, you’d like to skip most of those files entirely—but Parquet’s 
built-in min/max statistics might not work when partitions cover a wide range 
of values on the predicate column, and Bloom filters can still incur 
substantial overhead.
   
   In this post, we’ll introduce a custom distinct-value index with code 
example, that lets you efficiently prune away irrelevant files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


djanderson commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190897208


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metada

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190931575


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metadata. T

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


djanderson commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190893220


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metada

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3045780524

   Thanks @zhuqi-lucas @JigaoLuo @alamb  
   Added some possible minor improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190517681


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metadata

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190508032


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types[2](#footnote2) of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+
+
+
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metadata

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190494148


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   We can expand this section if needed including some more examples like 
   | System / Project   | Index Type
  | Description |
   
||--|-|
   | **Apache Iceberg** | Hidden partitioning, metadata tables, 
external indexes via integrations | Supports partition pruning and metadata 
filtering. External indexing possible via tools like Nessie or OpenMetadata. |
   | **Apache Hudi**| Bloom filter index, Column stats index, 
Metadata table index | Uses internal/external indexes, such as Bloom filters 
for key lookups and metadata table for faster file indexing. |
   | **Delta Lake** | Data skipping with min/max, Z-order 
indexing, custom via OSS | No native general indexing, but Z-ordering and 
external tools like Hyperspace enable indexing. |
   | **Microsoft Hyperspace**   | Covering indexes, Z-order, sorted indexes 
   | Spark-based library for building and maintaining secondary 
indexes on Parquet datasets. |
   | **ClickHouse (w/ Parquet)**| Skip indexes, minmax, bloom filter
  | Supports indexing on Parquet input via native skip indexes for 
faster query performance. |
   | **DuckDB** | Automatic statistics, zone maps, 
experimental indexing   | Maintains internal stats and supports some persistent 
indexing for Parquet reads. |
   | **Dremio** | Reflections (materializations), internal 
column stats| Builds external materialized views (Reflections) over Parquet 
for acceleration. |
   | **Lucene / Elasticsearch / OpenSearch** | Inverted index, range, spatial   
| External systems that can index Parquet content or 
extracted metadata for fast search. |
   | **Varada (Starburst)** | Bitmaps, adaptive indexes on Presto over 
Parquet | Built adaptive indexes for Parquet datasets to accelerate 
selective queries. |
   | **Starburst Galaxy / Trino**   | Connector-level support, custom index 
cache (roadmap)| Some support via caching and pruning; external indexing 
under active development. |
   | **LakeSoul**   | Z-order, data skipping
   | Supports column-aware skipping and optional Z-ordering for 
efficient reads. |
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queri

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190494148


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   We can add more example in this section if needed  
   | System / Project   | Index Type
  | Description |
   
||--|-|
   | **Apache Iceberg** | Hidden partitioning, metadata tables, 
external indexes via integrations | Supports partition pruning and metadata 
filtering. External indexing possible via tools like Nessie or OpenMetadata. |
   | **Apache Hudi**| Bloom filter index, Column stats index, 
Metadata table index | Uses internal/external indexes, such as Bloom filters 
for key lookups and metadata table for faster file indexing. |
   | **Delta Lake** | Data skipping with min/max, Z-order 
indexing, custom via OSS | No native general indexing, but Z-ordering and 
external tools like Hyperspace enable indexing. |
   | **Microsoft Hyperspace**   | Covering indexes, Z-order, sorted indexes 
   | Spark-based library for building and maintaining secondary 
indexes on Parquet datasets. |
   | **ClickHouse (w/ Parquet)**| Skip indexes, minmax, bloom filter
  | Supports indexing on Parquet input via native skip indexes for 
faster query performance. |
   | **DuckDB** | Automatic statistics, zone maps, 
experimental indexing   | Maintains internal stats and supports some persistent 
indexing for Parquet reads. |
   | **Dremio** | Reflections (materializations), internal 
column stats| Builds external materialized views (Reflections) over Parquet 
for acceleration. |
   | **Lucene / Elasticsearch / OpenSearch** | Inverted index, range, spatial   
| External systems that can index Parquet content or 
extracted metadata for fast search. |
   | **Varada (Starburst)** | Bitmaps, adaptive indexes on Presto over 
Parquet | Built adaptive indexes for Parquet datasets to accelerate 
selective queries. |
   | **Starburst Galaxy / Trino**   | Connector-level support, custom index 
cache (roadmap)| Some support via caching and pruning; external indexing 
under active development. |
   | **LakeSoul**   | Z-order, data skipping
   | Supports column-aware skipping and optional Z-ordering for 
efficient reads. |
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this servic

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-07 Thread via GitHub


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190485839


##
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).

Review Comment:
   ```suggestion
   Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. Apache DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself[1](#footnote1).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]