Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-06 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041532117

   I pushed some non trivial changes to this blog: 
   1. Added @JigaoLuo  as an author (hope this is ok @zhuqi-lucas )
   2. Added a section with a high level overview of adding user defined indexes
   3. Focused the example section on reading/writing the index and integrating 
it into DataFusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-06 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041511194

   Thank you -- I have spent a while this morning adding additional content -- 
I will push an update soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-06 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041421377

   > zhuqi-lucas#1
   
   Thank you @JigaoLuo , merged your changes!
   
   
   
   > Two small nitpicks I came across today:
   > 
   > * "Footer" vs. "Metadata" ?: Apologies for being pedantic, but I think 
we’re consistently referring to metadata here, not just the footer. Xiangpeng 
also corrected me on this elsewhere:
   > 
   > > footer often refers to the last 8 byte of Parquet file
   > 
   > * One small thing&question to consider—does support for user-defined 
indexes depend on a specific version of Parquet? If so, it might be helpful to 
add a brief note about that. I’m not sure of the answer myself, but it could be 
worth clarifying.
   
   
   
   You’re absolutely right—what we're describing is the file‑level metadata 
(the key_value_metadata in the FileMetaData Thrift struct), not just the last 8 
bytes of the file. In Parquet parlance, “footer” technically refers to the file 
trailer (the magic + length + magic markers), whereas “metadata” covers 
everything in the FileMetaData block (including all custom key‑value pairs). We 
should consistently say “metadata” throughout to avoid that confusion.
   
   
   Custom (user‑defined) metadata via the FileMetaData.key_value_metadata map 
has been part of the Parquet format since its earliest releases . Any 
reader/writer that implements basic Parquet will preserve arbitrary file 
metadata fields.
   
   But our arrow-rs dependencies should be >= 55.2.0 to keep writing 
consistency for internal buffer.
   
   > **Prerequisite:** Requires **arrow‑rs v55.2.0** or later, which includes 
the new “buffered write” API 
([apache/arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)).  
   > This API keeps the internal byte count in sync so you can append index 
bytes immediately after data pages. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-06 Thread via GitHub


JigaoLuo commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041396574

   And the outlook section I did is in this PR: 
https://github.com/zhuqi-lucas/datafusion-site/pull/1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-06 Thread via GitHub


JigaoLuo commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041395147

   Two small nitpicks I came across today:
   - "Footer" vs. "Metadata" ?: Apologies for being pedantic, but I think we’re 
consistently referring to metadata here, not just the footer. Xiangpeng also 
corrected me on this elsewhere: 
   > footer often refers to the last 8 byte of Parquet file
   
   - One small thing&question to consider—does support for user-defined indexes 
depend on a specific version of Parquet? If so, it might be helpful to add a 
brief note about that. I’m not sure of the answer myself, but it could be worth 
clarifying.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040695911

   > I just pushed a commit that reworked the intro a bit and started filling 
out the background
   > 
   > https://private-user-images.githubusercontent.com/490673/462836291-efe36816-7fed-44d7-9158-1b2fc19ffb19.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTE3NzI4NjIsIm5iZiI6MTc1MTc3MjU2MiwicGF0aCI6Ii80OTA2NzMvNDYyODM2MjkxLWVmZTM2ODE2LTdmZWQtNDRkNy05MTU4LTFiMmZjMTlmZmIxOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNzA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDcwNlQwMzI5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03NmIzOTg5OTg2NzliM2VkMmFhYWNkNWRmYmU5NDU5ZWViZTM0NGE5NWMwY2U1NzA4MDk4M2FiYjQ1ZjEzY2NkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.P56Oj5QdDq5Yr98KkqtAdSxBLwjwKdDJn0I4qTbGW0E";>
 https://private-user-images.githubusercontent.com/490673/462836293-07e36677-46a6-4f2e
 
-8b8a-c5eab9545167.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTE3NzI4NjIsIm5iZiI6MTc1MTc3MjU2MiwicGF0aCI6Ii80OTA2NzMvNDYyODM2MjkzLTA3ZTM2Njc3LTQ2YTYtNGYyZS04YjhhLWM1ZWFiOTU0NTE2Ny5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNzA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDcwNlQwMzI5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zNWU2NjE0NDBiNmJjZWZjMzQ0ZTAxYThjNjIyNGE4NjhiYmExMTM5Y2MyNDdiMTNlNmU3NzU2ZjA0ZjJjNjQ1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.QFx5p_md3rqLd8SgHPea0zLaBNvmQpE-VdN--QdSlTY">
   > @JigaoLuo the outlook section you describe sounds great. I envision it 
right after the
   > 
   > ```
   > ## 1. Parquet 101: File Anatomy & Standard Index Structures
   > ```
   > 
   > Section
   > 
   > Perhaps like
   > 
   > ```
   > ## 2. Extending Parquet with Special Indexes
   > ```
   > 
   > (this is where figure 2 goes and where we will explain how to embed a 
custom index). So it makes a lot of sense to mention here the potential 
usecases (and that the index can be written after each row group or at the end 
of the file, and it can have information for each row group, individual row 
groups, columns, etc, whatever you want
   > 
   > I would also be interested to hear what @zhuqi-lucas thinks
   
   Amazing work thank you @alamb.
   
   
   
   > Regarding my impression during reading: **"the Embedded Index is just a 
hashset to speed up scans, which adds overhead to Parquet."** as mentioned as a 
follow-up here: [#79 
(comment)](https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247)
   > 
   > If other readers also has the same impression, it might unintentionally 
limit how readers perceive its potential of the Embedded Index. To address 
this, we could consider adding **a short Outlook section** (either at the 
beginning or the end of the blog) to explicitly highlight what the Embedded 
Index is capable of. It’s not just a hashset for pruning; in principle, it 
could support a wide range of use cases. Use cases are also discussed here: 
[apache/datafusion#16374 
(comment)](https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047)
   > 
   > I’d be happy to help draft such an Outlook section, pending confirmation 
from your side.
   
   Looks great @JigaoLuo ! Feel free to add it, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245879

   I need to run now to attend to to some family matters. I'll be back tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245413

   I just pushed a commit that reworked the intro a bit and started filling out 
the background
   
   https://github.com/user-attachments/assets/efe36816-7fed-44d7-9158-1b2fc19ffb19";
 />
   https://github.com/user-attachments/assets/07e36677-46a6-4f2e-8b8a-c5eab9545167";
 />
   
   @JigaoLuo the outlook section you describe sounds great. I envision it right 
after the 
   ```
   ## 1. Parquet 101: File Anatomy & Standard Index Structures
   ```
   
   Section
   
   Perhaps like
   ```
   ## 2. Extending Parquet with Special Indexes
   ```
   (this is where figure 2 goes and where we will explain how to embed a custom 
index).
   So it makes a lot of sense to mention here the potential usecases (and that 
the index can be written after each row group or at the end of the file, and it 
can have information for each row group, individual row groups, columns, etc, 
whatever you want
   
   I would also be interested to hear what @zhuqi-lucas thinks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


JigaoLuo commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039990579

   This also ties into my initial impression that “the Embedded Index is just a 
hashset to speed up scans, which adds overhead to Parquet." as mentioned as a 
follow-up here: 
https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247 
   
I think this framing might unintentionally limit how readers perceive its 
potential. To address this, we could consider adding **an Outlook section** 
(either at the beginning or the end of the blog) to explicitly highlight what 
the Embedded Index is capable of. It’s not just a hashset for pruning; in 
principle, it could support a wide range of use cases. Use cases are also 
discussed here: 
https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047
   
   I’d be happy to help draft such an Outlook section, pending confirmation 
from your side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039820680

   Thanks -- I am going to spend an hour or so taking a pass through this blog 
trying to get the formatting to work out
   
   So exciting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039224050

   > Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great 
overall. I just have one very small nitpick above.
   > 
   > Regarding the content: One suggestion would be to include a reference to 
the fact that there has been some criticism and attempts to incorporate 
HyperLogLog into Parquet, as mentioned here [apache/datafusion#16374 
(comment)](https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391).
   > 
   > (I also have a few personal questions about the new index itself, but I'll 
post them on the Issue page instead of here.)
   
   Thank you @JigaoLuo for good point, i am working on some urgent bug fixes, 
will try to add your good suggestions soon! Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039222981

   > Thanks @zhuqi-lucas -- I will keep looking at this later today
   
   Thank you @alamb !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


JigaoLuo commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039088620

   Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great 
overall. I just have one very small nitpick above.
   
   Regarding the content: One suggestion would be to include a reference to the 
fact that there has been some criticism and attempts to incorporate HyperLogLog 
into Parquet, as mentioned here 
https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391.
   
   (I also have a few personal questions about the new index itself, but I'll 
post them on the Issue page instead of here.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039045610

   Thanks @zhuqi-lucas  -- I will keep looking at this later today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3038122113

   Thank you @alamb ! Addressed comments for the first round, but the image 
still not add to the content due to it not showing well in my local.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186696545


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,251 @@
+## Extending Parquet with Embedded Indexes and Accelerating Query Processing 
with DataFusion
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything "smarter" requires inventing 
a whole new file format. In fact, Parquet's column‑oriented design, with its 
well‑defined footer metadata and reserved byte regions, already provides the 
flexibility to embed arbitrary indexing structures without breaking 
compatibility. 
+
+In this post, we'll first review the core concepts of the Apache Parquet file 
format. Then explain how to store custom indexes inside Parquet files, and 
finally show how Apache DataFusion can leverage a **compact distinct‑value 
index** to achieve ultra‑fast file‑level pruning—all while preserving complete 
interchangeability with other tools.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. 
+For example, rewriting ClickBench partitioned dataset with better settings* 
(not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149). 
[JigaoLuo](https://github.com/JigaoLuo) and 
[XiangpengHao](https://github.com/XiangpengHao) have been exploring these 
Parquet‑rewriting techniques over in the liquid‑cache which is using 
DataFusion, repo—check out 
[XiangpengHao/liquid‑cache#227](https://github.com/XiangpengHao/liquid-cache/issues/227)
 for more insights.
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** Requires **arrow‑rs v55.2.0** or later, which includes the 
new “buffered write” API 
([apache/arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)).  
+> This API keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## 1. Parquet 101: File Anatomy & Native Pruning Hooks
+TODO add image here?

Review Comment:
   @alamb  I tried to add the image, but it seems not showing well for my local 
preview, i am not sure why, so i add todo here...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186669292


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in
+> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714),
+> which keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## Background
+
+Several examples in the DataFusion repository illustrate the benefits of using 
external indexes for pruning:
+
+* 
[`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs)
+* 
[`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs)
+
+Those demos work by building separate index files (Bloom filters, maps of 
distinct values) and associating them with Parquet files. While effective, this 
approach:
+
+* **Increases operational complexity:** Two files per dataset to track.
+* **Risks synchronization issues:** Removing or renaming one file breaks the 
index.
+* **Reduces portability:** Harder to share or move Parquet data when the index 
is external.
+
+Meanwhile, critics of Parquet’s extensibility point to the lack of a 
*standard* way to embed auxiliary data (see Amudai). But in practice, Parquet 
tolerates unknown content gracefully:
+
+* **Arbitrary metadata:** Key/value pairs in the footer are opaque to readers.
+* **Unused regions:** Bytes after data pages (before the Thrift footer) are 
ignored by standard readers.
+
+We’ll exploit both to embed our index inline.
+
+---
+
+## Motivation
+
+When scanning Parquet files, DataFusion (like other engines) reads row group 

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in
+> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714),
+> which keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## Background
+
+Several examples in the DataFusion repository illustrate the benefits of using 
external indexes for pruning:
+
+* 
[`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs)
+* 
[`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs)
+
+Those demos work by building separate index files (Bloom filters, maps of 
distinct values) and associating them with Parquet files. While effective, this 
approach:
+
+* **Increases operational complexity:** Two files per dataset to track.
+* **Risks synchronization issues:** Removing or renaming one file breaks the 
index.
+* **Reduces portability:** Harder to share or move Parquet data when the index 
is external.
+
+Meanwhile, critics of Parquet’s extensibility point to the lack of a 
*standard* way to embed auxiliary data (see Amudai). But in practice, Parquet 
tolerates unknown content gracefully:
+
+* **Arbitrary metadata:** Key/value pairs in the footer are opaque to readers.
+* **Unused regions:** Bytes after data pages (before the Thrift footer) are 
ignored by standard readers.
+
+We’ll exploit both to embed our index inline.
+
+---
+
+## Motivation
+
+When scanning Parquet files, DataFusion (like other engines) reads row group 

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186552221


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in
+> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714),
+> which keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## Background
+
+Several examples in the DataFusion repository illustrate the benefits of using 
external indexes for pruning:
+
+* 
[`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs)
+* 
[`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs)
+
+Those demos work by building separate index files (Bloom filters, maps of 
distinct values) and associating them with Parquet files. While effective, this 
approach:
+
+* **Increases operational complexity:** Two files per dataset to track.
+* **Risks synchronization issues:** Removing or renaming one file breaks the 
index.
+* **Reduces portability:** Harder to share or move Parquet data when the index 
is external.
+
+Meanwhile, critics of Parquet’s extensibility point to the lack of a 
*standard* way to embed auxiliary data (see Amudai). But in practice, Parquet 
tolerates unknown content gracefully:

Review Comment:
   Good suggestion @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure 

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186543546


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in

Review Comment:
   Good point @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186527671


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves

Review Comment:
   Good suggestion @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186508081


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.

Review Comment:
   Good suggestion @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3037856509

   > 
https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85
   
   Thank you @alamb for review and great suggestions! I will try to address 
today, and feel free to edit this blog and correct me if i am missing anything, 
thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


alamb commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2185866801


##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max 
pruning and Bloom filters—and that adding anything “smarter” requires inventing 
a whole new file format. In fact, Parquet’s design already lets you embed 
custom indexing data *inside* the file (via unused footer metadata and byte 
regions) without breaking compatibility. In this post, we’ll show how 
DataFusion can leverage a **compact distinct‑value index** written directly 
into Parquet files—preserving complete interchangeability with other 
tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can 
have good improvement also. For example, rewriting ClickBench partitioned 
dataset with better settings* (not resorting) improves
+performance by more than 2x for many queries. So with a custom index, we can 
expect even more improvement.
+More details: [Blog post about parquet vs custom file formats #16149
+](https://github.com/apache/datafusion/issues/16149)
+
+Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with 
DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll:
+
+1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom 
filters).
+2. Introduce a simple on‑page binary format for a distinct‑value index.
+3. Show how to append that index inline, record its offset in the footer, and 
have DataFusion consume it at query time.
+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code 
from
+   
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in
+> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714),
+> which keeps the internal byte count in sync so you can append index bytes 
immediately after data pages.
+
+---
+
+## Introduction
+
+Parquet is a popular columnar format tuned for high‑performance analytics: 
column pruning, predicate pushdown, page indices and Bloom filters all help 
reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), 
engines often still scan entire row groups or files that contain zero matches.
+
+Many systems solve this by producing *external* index files—Bloom filters, 
inverted lists, or custom sketches—alongside Parquet. But juggling separate 
index files adds operational overhead and risks out‑of‑sync data. Worse, some 
have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai 
spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)).
+
+**But Parquet itself is extensible**: it tolerates unknown bytes after data 
pages and arbitrary key/value pairs in its footer. We can exploit those hooks 
to **embed** a small, per‑file distinct‑value index directly in the file—no 
extra files, no format forks, and no compatibility breakage.
+
+In the rest of this post, we’ll:
+
+1. Walk through the simple binary layout for a distinct‑value list.
+2. Show how to write it inline after the normal Parquet pages.
+3. Record its offset in the footer’s metadata map.
+4. Extend DataFusion’s `TableProvider` to discover and use that index for 
file‑level pruning.
+5. Verify everything still works in DuckDB via `read_parquet()`.
+
+---
+
+## Background
+
+Several examples in the DataFusion repository illustrate the benefits of using 
external indexes for pruning:
+
+* 
[`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs)
+* 
[`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs)
+
+Those demos work by building separate index files (Bloom filters, maps of 
distinct values) and associating them with Parquet files. While effective, this 
approach:
+
+* **Increases operational complexity:** Two files per dataset to track.
+* **Risks synchronization issues:** Removing or renaming one file breaks the 
index.
+* **Reduces portability:** Harder to share or move Parquet data when the index 
is external.
+
+Meanwhile, critics of Parquet’s extensibility point to the lack of a 
*standard* way to embed auxiliary data (see Amudai). But in practice, Parquet 
tolerates unknown content gracefully:

Review Comment:
   Here is a link to the amudai docs that might be good to include: 
https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md



##
content/blog/datafusion-custom-parquet-index.md:
##
@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a c

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035299849

   Thank you @alamb , i will keep polishing it before you reviewing!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


alamb commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035285253

   This is amazing -- thank you @zhuqi-lucas and @2010YOUY01  -- I will review 
this asap, but as today is a holiday in the US I may not have a chance to do so 
until tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035225130

   > This post is great, I find the content easy to follow.
   > 
   > I have a suggestion for the first paragraph though: perhaps we should 
emphasize the motivation more clearly at the beginning. I think @alamb 's point 
in the YouTube video is particularly compelling — we don’t need to invent a new 
file format to support additional indexing. Instead, we can extend Parquet with 
custom indexes without compromising the file format’s interchangeability.
   
   Thank you @2010YOUY01 for review, good point, in latest version, i added the 
point that we don't need a new format, parquet itself is very good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


2010YOUY01 commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035056285

   This post is great, I find the content easy to follow.
   
   I have a suggestion for the first paragraph though: perhaps we should 
emphasize the motivation more clearly at the beginning. I think @alamb 's point 
in the YouTube video is particularly compelling — we don’t need to invent a new 
file format to support additional indexing. Instead, we can extend Parquet with 
custom indexes without compromising the file format’s interchangeability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3034853896

   I am not expert for blog, welcome folks to polish it together, thanks a lot! 
cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]