Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041532117 I pushed some non trivial changes to this blog: 1. Added @JigaoLuo as an author (hope this is ok @zhuqi-lucas ) 2. Added a section with a high level overview of adding user defined indexes 3. Focused the example section on reading/writing the index and integrating it into DataFusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041511194 Thank you -- I have spent a while this morning adding additional content -- I will push an update soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041421377 > zhuqi-lucas#1 Thank you @JigaoLuo , merged your changes! > Two small nitpicks I came across today: > > * "Footer" vs. "Metadata" ?: Apologies for being pedantic, but I think we’re consistently referring to metadata here, not just the footer. Xiangpeng also corrected me on this elsewhere: > > > footer often refers to the last 8 byte of Parquet file > > * One small thing&question to consider—does support for user-defined indexes depend on a specific version of Parquet? If so, it might be helpful to add a brief note about that. I’m not sure of the answer myself, but it could be worth clarifying. You’re absolutely right—what we're describing is the file‑level metadata (the key_value_metadata in the FileMetaData Thrift struct), not just the last 8 bytes of the file. In Parquet parlance, “footer” technically refers to the file trailer (the magic + length + magic markers), whereas “metadata” covers everything in the FileMetaData block (including all custom key‑value pairs). We should consistently say “metadata” throughout to avoid that confusion. Custom (user‑defined) metadata via the FileMetaData.key_value_metadata map has been part of the Parquet format since its earliest releases . Any reader/writer that implements basic Parquet will preserve arbitrary file metadata fields. But our arrow-rs dependencies should be >= 55.2.0 to keep writing consistency for internal buffer. > **Prerequisite:** Requires **arrow‑rs v55.2.0** or later, which includes the new “buffered write” API ([apache/arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)). > This API keeps the internal byte count in sync so you can append index bytes immediately after data pages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041396574 And the outlook section I did is in this PR: https://github.com/zhuqi-lucas/datafusion-site/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3041395147 Two small nitpicks I came across today: - "Footer" vs. "Metadata" ?: Apologies for being pedantic, but I think we’re consistently referring to metadata here, not just the footer. Xiangpeng also corrected me on this elsewhere: > footer often refers to the last 8 byte of Parquet file - One small thing&question to consider—does support for user-defined indexes depend on a specific version of Parquet? If so, it might be helpful to add a brief note about that. I’m not sure of the answer myself, but it could be worth clarifying. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040695911 > I just pushed a commit that reworked the intro a bit and started filling out the background > > https://private-user-images.githubusercontent.com/490673/462836291-efe36816-7fed-44d7-9158-1b2fc19ffb19.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTE3NzI4NjIsIm5iZiI6MTc1MTc3MjU2MiwicGF0aCI6Ii80OTA2NzMvNDYyODM2MjkxLWVmZTM2ODE2LTdmZWQtNDRkNy05MTU4LTFiMmZjMTlmZmIxOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNzA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDcwNlQwMzI5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03NmIzOTg5OTg2NzliM2VkMmFhYWNkNWRmYmU5NDU5ZWViZTM0NGE5NWMwY2U1NzA4MDk4M2FiYjQ1ZjEzY2NkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.P56Oj5QdDq5Yr98KkqtAdSxBLwjwKdDJn0I4qTbGW0E";> https://private-user-images.githubusercontent.com/490673/462836293-07e36677-46a6-4f2e -8b8a-c5eab9545167.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTE3NzI4NjIsIm5iZiI6MTc1MTc3MjU2MiwicGF0aCI6Ii80OTA2NzMvNDYyODM2MjkzLTA3ZTM2Njc3LTQ2YTYtNGYyZS04YjhhLWM1ZWFiOTU0NTE2Ny5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNzA2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDcwNlQwMzI5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zNWU2NjE0NDBiNmJjZWZjMzQ0ZTAxYThjNjIyNGE4NjhiYmExMTM5Y2MyNDdiMTNlNmU3NzU2ZjA0ZjJjNjQ1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.QFx5p_md3rqLd8SgHPea0zLaBNvmQpE-VdN--QdSlTY"> > @JigaoLuo the outlook section you describe sounds great. I envision it right after the > > ``` > ## 1. Parquet 101: File Anatomy & Standard Index Structures > ``` > > Section > > Perhaps like > > ``` > ## 2. Extending Parquet with Special Indexes > ``` > > (this is where figure 2 goes and where we will explain how to embed a custom index). So it makes a lot of sense to mention here the potential usecases (and that the index can be written after each row group or at the end of the file, and it can have information for each row group, individual row groups, columns, etc, whatever you want > > I would also be interested to hear what @zhuqi-lucas thinks Amazing work thank you @alamb. > Regarding my impression during reading: **"the Embedded Index is just a hashset to speed up scans, which adds overhead to Parquet."** as mentioned as a follow-up here: [#79 (comment)](https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247) > > If other readers also has the same impression, it might unintentionally limit how readers perceive its potential of the Embedded Index. To address this, we could consider adding **a short Outlook section** (either at the beginning or the end of the blog) to explicitly highlight what the Embedded Index is capable of. It’s not just a hashset for pruning; in principle, it could support a wide range of use cases. Use cases are also discussed here: [apache/datafusion#16374 (comment)](https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047) > > I’d be happy to help draft such an Outlook section, pending confirmation from your side. Looks great @JigaoLuo ! Feel free to add it, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245879 I need to run now to attend to to some family matters. I'll be back tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245413 I just pushed a commit that reworked the intro a bit and started filling out the background https://github.com/user-attachments/assets/efe36816-7fed-44d7-9158-1b2fc19ffb19"; /> https://github.com/user-attachments/assets/07e36677-46a6-4f2e-8b8a-c5eab9545167"; /> @JigaoLuo the outlook section you describe sounds great. I envision it right after the ``` ## 1. Parquet 101: File Anatomy & Standard Index Structures ``` Section Perhaps like ``` ## 2. Extending Parquet with Special Indexes ``` (this is where figure 2 goes and where we will explain how to embed a custom index). So it makes a lot of sense to mention here the potential usecases (and that the index can be written after each row group or at the end of the file, and it can have information for each row group, individual row groups, columns, etc, whatever you want I would also be interested to hear what @zhuqi-lucas thinks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039990579 This also ties into my initial impression that “the Embedded Index is just a hashset to speed up scans, which adds overhead to Parquet." as mentioned as a follow-up here: https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247 I think this framing might unintentionally limit how readers perceive its potential. To address this, we could consider adding **an Outlook section** (either at the beginning or the end of the blog) to explicitly highlight what the Embedded Index is capable of. It’s not just a hashset for pruning; in principle, it could support a wide range of use cases. Use cases are also discussed here: https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047 I’d be happy to help draft such an Outlook section, pending confirmation from your side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039820680 Thanks -- I am going to spend an hour or so taking a pass through this blog trying to get the formatting to work out So exciting -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039224050 > Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above. > > Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here [apache/datafusion#16374 (comment)](https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391). > > (I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.) Thank you @JigaoLuo for good point, i am working on some urgent bug fixes, will try to add your good suggestions soon! Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039222981 > Thanks @zhuqi-lucas -- I will keep looking at this later today Thank you @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039088620 Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above. Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391. (I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039045610 Thanks @zhuqi-lucas -- I will keep looking at this later today -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3038122113 Thank you @alamb ! Addressed comments for the first round, but the image still not add to the content due to it not showing well in my local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186696545 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,251 @@ +## Extending Parquet with Embedded Indexes and Accelerating Query Processing with DataFusion + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything "smarter" requires inventing a whole new file format. In fact, Parquet's column‑oriented design, with its well‑defined footer metadata and reserved byte regions, already provides the flexibility to embed arbitrary indexing structures without breaking compatibility. + +In this post, we'll first review the core concepts of the Apache Parquet file format. Then explain how to store custom indexes inside Parquet files, and finally show how Apache DataFusion can leverage a **compact distinct‑value index** to achieve ultra‑fast file‑level pruning—all while preserving complete interchangeability with other tools. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. +For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149). [JigaoLuo](https://github.com/JigaoLuo) and [XiangpengHao](https://github.com/XiangpengHao) have been exploring these Parquet‑rewriting techniques over in the liquid‑cache which is using DataFusion, repo—check out [XiangpengHao/liquid‑cache#227](https://github.com/XiangpengHao/liquid-cache/issues/227) for more insights. + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** Requires **arrow‑rs v55.2.0** or later, which includes the new “buffered write” API ([apache/arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)). +> This API keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## 1. Parquet 101: File Anatomy & Native Pruning Hooks +TODO add image here? Review Comment: @alamb I tried to add the image, but it seems not showing well for my local preview, i am not sure why, so i add todo here... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186669292 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in +> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714), +> which keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## Background + +Several examples in the DataFusion repository illustrate the benefits of using external indexes for pruning: + +* [`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs) +* [`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) + +Those demos work by building separate index files (Bloom filters, maps of distinct values) and associating them with Parquet files. While effective, this approach: + +* **Increases operational complexity:** Two files per dataset to track. +* **Risks synchronization issues:** Removing or renaming one file breaks the index. +* **Reduces portability:** Harder to share or move Parquet data when the index is external. + +Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: + +* **Arbitrary metadata:** Key/value pairs in the footer are opaque to readers. +* **Unused regions:** Bytes after data pages (before the Thrift footer) are ignored by standard readers. + +We’ll exploit both to embed our index inline. + +--- + +## Motivation + +When scanning Parquet files, DataFusion (like other engines) reads row group
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186572247 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in +> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714), +> which keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## Background + +Several examples in the DataFusion repository illustrate the benefits of using external indexes for pruning: + +* [`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs) +* [`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) + +Those demos work by building separate index files (Bloom filters, maps of distinct values) and associating them with Parquet files. While effective, this approach: + +* **Increases operational complexity:** Two files per dataset to track. +* **Risks synchronization issues:** Removing or renaming one file breaks the index. +* **Reduces portability:** Harder to share or move Parquet data when the index is external. + +Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: + +* **Arbitrary metadata:** Key/value pairs in the footer are opaque to readers. +* **Unused regions:** Bytes after data pages (before the Thrift footer) are ignored by standard readers. + +We’ll exploit both to embed our index inline. + +--- + +## Motivation + +When scanning Parquet files, DataFusion (like other engines) reads row group
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186552221 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in +> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714), +> which keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## Background + +Several examples in the DataFusion repository illustrate the benefits of using external indexes for pruning: + +* [`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs) +* [`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) + +Those demos work by building separate index files (Bloom filters, maps of distinct values) and associating them with Parquet files. While effective, this approach: + +* **Increases operational complexity:** Two files per dataset to track. +* **Risks synchronization issues:** Removing or renaming one file breaks the index. +* **Reduces portability:** Harder to share or move Parquet data when the index is external. + +Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: Review Comment: Good suggestion @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186543546 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in Review Comment: Good point @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186527671 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves Review Comment: Good suggestion @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2186508081 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. Review Comment: Good suggestion @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3037856509 > https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85 Thank you @alamb for review and great suggestions! I will try to address today, and feel free to edit this blog and correct me if i am missing anything, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2185866801 ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. + +And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves +performance by more than 2x for many queries. So with a custom index, we can expect even more improvement. +More details: [Blog post about parquet vs custom file formats #16149 +](https://github.com/apache/datafusion/issues/16149) + +Building on the ideas from Andrew Lamb’s talk on [indexing Parquet with DataFusion](https://www.youtube.com/watch?v=74YsJT1-Rdk), we’ll: + +1. Review Parquet’s built‑in metadata hooks (Min/Max, page index, Bloom filters). +2. Introduce a simple on‑page binary format for a distinct‑value index. +3. Show how to append that index inline, record its offset in the footer, and have DataFusion consume it at query time. +4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from + [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). + +> **Prerequisite:** this example requires the new “buffered write” API in +> [apache/arrow‑rs#7714](https://github.com/apache/arrow-rs/pull/7714), +> which keeps the internal byte count in sync so you can append index bytes immediately after data pages. + +--- + +## Introduction + +Parquet is a popular columnar format tuned for high‑performance analytics: column pruning, predicate pushdown, page indices and Bloom filters all help reduce I/O. Yet when predicates are highly selective (e.g. `category = 'foo'`), engines often still scan entire row groups or files that contain zero matches. + +Many systems solve this by producing *external* index files—Bloom filters, inverted lists, or custom sketches—alongside Parquet. But juggling separate index files adds operational overhead and risks out‑of‑sync data. Worse, some have used that pain point to justify brand‑new formats (see Microsoft’s [Amudai spec](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md)). + +**But Parquet itself is extensible**: it tolerates unknown bytes after data pages and arbitrary key/value pairs in its footer. We can exploit those hooks to **embed** a small, per‑file distinct‑value index directly in the file—no extra files, no format forks, and no compatibility breakage. + +In the rest of this post, we’ll: + +1. Walk through the simple binary layout for a distinct‑value list. +2. Show how to write it inline after the normal Parquet pages. +3. Record its offset in the footer’s metadata map. +4. Extend DataFusion’s `TableProvider` to discover and use that index for file‑level pruning. +5. Verify everything still works in DuckDB via `read_parquet()`. + +--- + +## Background + +Several examples in the DataFusion repository illustrate the benefits of using external indexes for pruning: + +* [`parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs) +* [`advanced_parquet_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) + +Those demos work by building separate index files (Bloom filters, maps of distinct values) and associating them with Parquet files. While effective, this approach: + +* **Increases operational complexity:** Two files per dataset to track. +* **Risks synchronization issues:** Removing or renaming one file breaks the index. +* **Reduces portability:** Harder to share or move Parquet data when the index is external. + +Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: Review Comment: Here is a link to the amudai docs that might be good to include: https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md ## content/blog/datafusion-custom-parquet-index.md: ## @@ -0,0 +1,232 @@ +## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes + +It’s a c
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035299849 Thank you @alamb , i will keep polishing it before you reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035285253 This is amazing -- thank you @zhuqi-lucas and @2010YOUY01 -- I will review this asap, but as today is a holiday in the US I may not have a chance to do so until tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035225130 > This post is great, I find the content easy to follow. > > I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability. Thank you @2010YOUY01 for review, good point, in latest version, i added the point that we don't need a new format, parquet itself is very good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
2010YOUY01 commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3035056285 This post is great, I find the content easy to follow. I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3034853896 I am not expert for blog, welcome folks to polish it together, thanks a lot! cc @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
