Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-04 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3034804227

   Thank you @alamb , i will submit a draft blog soon in:
   
   https://github.com/apache/datafusion/issues/16372


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-03 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3033685921

   This is so great -- now we just need to write up a blog post 🎣 
   
   
   Thanks again @zhuqi-lucas  -- this is going to be great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-03 Thread via GitHub


alamb merged PR #16395:
URL: https://github.com/apache/datafusion/pull/16395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-02 Thread via GitHub


jcsherin commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027788574

   @alamb The overview documentation is very clear and love the ASCII art.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-02 Thread via GitHub


jcsherin commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2179977087


##
datafusion-examples/examples/parquet_embedded_index.rs:
##
@@ -0,0 +1,472 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Embedding and using a custom index in Parquet files
+//!
+//! # Background
+//!
+//! This example shows how to add an application‑specific index to an Apache
+//! Parquet file without modifying the Parquet format itself. The resulting
+//! files can be read by any standard Parquet reader, which will simply
+//! ignore the extra index data.
+//!
+//! A β€œdistinct value” index, similar to a  ["set" Skip Index in ClickHouse],
+//! is stored in a custom binary format within the parquet file. Only the
+//! location of index is stored in Parquet footer key/value metadata.
+//! This approach is more efficient than storing the index itself in the footer
+//! metadata because the footer must be read and parsed by all readers,
+//! even those that do not use the index.
+//!
+//! The resulting Parquet file layout is as follows:
+//!
+//! ```text
+//!   β”Œβ”€β”€β”   
+//!   β”‚β”Œβ”€β”€β”€β” β”‚   
+//!   β”‚β”‚ DataPage  β”‚ β”‚   
+//!   β”‚β””β”€β”€β”€β”˜ β”‚   
+//!  Standard Parquet β”‚β”Œβ”€β”€β”€β” β”‚   
+//!  Data Pages   β”‚β”‚ DataPage  β”‚ β”‚   
+//!   β”‚β””β”€β”€β”€β”˜ β”‚   
+//!   β”‚...   β”‚   
+//!   β”‚β”Œβ”€β”€β”€β” β”‚   
+//!   β”‚β”‚ DataPage  β”‚ β”‚   
+//!   β”‚β””β”€β”€β”€β”˜ β”‚   
+//!   │┏━━━┓ β”‚   
+//! Non standard  │┃   ┃ β”‚   
+//! index (ignored by │┃Custom Binary Index┃ β”‚   
+//! other Parquet │┃ (Distinct Values) ┃◀│─ ─ ─  
+//! readers)  │┃   ┃ β”‚ β”‚ 
+//!   │┗━━━┛ β”‚   
+//! Standard Parquet  │┏━━━┓ β”‚ β”‚  key/value metadata
+//! Page Index│┃Page Index ┃ β”‚contains location  
+//!   │┗━━━┛ β”‚ β”‚  of special index   
+//!   │╔═══╗ β”‚   
+//!   β”‚β•‘ Parquet Footer w/ β•‘ β”‚ β”‚ 
+//!   β”‚β•‘ Metadata  β•‘ β”Ό ─ ─   
+//!   β”‚β•‘ (Thrift Encoded)  β•‘ β”‚   
+//!   β”‚β•šβ•β•β•β• β”‚   
+//!   β””β”€β”€β”˜   
+//!  
+//! Parquet File 
+//!
+//! # High Level Flow
+//!
+//! To create a custom Parquet index:
+//!
+//! 1. Compute the index and serialize it to a binary format.
+//!
+//! 2. Write the Parquet file with:
+//!- regular data pages
+//!- the serialized index inline
+//!- footer key/value metadata entry to locate the index
+//!
+//! To read and use the index are:
+//!
+//! 1. Read and deserialize the file’s footer to locate the index.
+//!
+//! 2. Read and deserialize the index.
+//!
+//! 3. Create a `TableProvider` that knows how to use the index to quickly find
+//!   the relevant files, row groups, data pages or rows based on on pushed 
down
+//!   filters.
+//!
+//! # FAQ: Why do other Parquet readers skip over the custom index?
+//!
+//! The flow for reading a parquet file is:
+//!
+//! 1. Seek to the end of the file and read the last 8 bytes (a 4‑byte
+//!little‑endian footer length followed by the `PAR1` magic bytes).
+//!
+//! 2. Seek backwards by that length to parse the Thrift‑encoded footer
+//!  

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-02 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027610547

   > * 54a9e61
   
   Thank you @alamb looks great to me!
   
   > Simplified the code to only write the offset index (the length is stored 
inline)
   
   Perfect for this change!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-02 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027590526

   I think it is now ready to merge, but it would probably be good for someone 
else to go over it one last time to make sure it is clear


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-02 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027588395

   Hi @zhuqi-lucas  -- spent a while this morning going over this PR carefully 
-- it is great!
   
   I hope you don't mind but I made some substantial edits to try and make it 
read a bit better:
   1. Revamped the documentation and overview
   2. Updated the ASCII art
   3. moved reading the index into DistinctIndex
   4. Added a bunch more comments
   5. Simplified the code to only write the offset index (the length is stored 
inline)
   
   In my mind none of this was required but since I plan to make a Huge Deal 
(TM) about this example publically I figured spending some extra time polishing 
it would be worthwhle
   
   - 
[54a9e61](https://github.com/apache/datafusion/pull/16395/commits/54a9e610ed64448f95fe8526129871c63a8efcff)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-01 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3026260872

   > Thank you @zhuqi-lucas -- I started going through this PR again in detail
   > 
   > I renamed the example to align with the other parquet examples, and I 
added it to the list of examples.
   > 
   > I also took a pass through the comments.
   > 
   > I have run out of time today, but I'll finish it up first thing tomorrow 
and hopefully merge
   > 
   > Thank you again so much!
   
   Thank you very much @alamb ! This looks pretty good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-01 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2178933212


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into 
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a little‑endian length, to identify our 
custom index format
+//!- the serialized distinct‑values bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and 
`distinct_index_length`)
+//! 3. Read back each file’s footer metadata to locate and deserialize the 
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans 
footers
+//!into a map of filename β†’ `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesn’t 
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, application‑specific index directly 
in Parquet
+//! metadata to achieve efficient file‑level pruning without modifying the 
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the 
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+//!
+//! **Compatibility note: why other Parquet readers simply skip over our extra 
index blob**
+//!
+//! Any standard Parquet reader will:
+//! 1. Seek to the end of the file and read the last 8Β bytes (a 4‑byte 
little‑endian footer length followed by the `PAR1` magic).
+//! 2. Seek backwards by that length to parse only the Thrift‑encoded footer 
metadata (including key/value pairs).
+//!
+//! Since our custom index bytes are appended *before* the footer (and we do 
not alter Parquet’s metadata schema), readers
+//! never scan from the file start or β€œoverflow” into our blob. They will 
encounter two unknown keys
+//! (`distinct_index_offset` and `distinct_index_length`) in the footer 
metadata, ignore them (or expose as extra metadata),
+//! and will not attempt to read or deserialize the raw index bytes.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a page‑index offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-07-01 Thread via GitHub


alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2178596921


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into 
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a little‑endian length, to identify our 
custom index format
+//!- the serialized distinct‑values bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and 
`distinct_index_length`)
+//! 3. Read back each file’s footer metadata to locate and deserialize the 
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans 
footers
+//!into a map of filename β†’ `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesn’t 
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, application‑specific index directly 
in Parquet
+//! metadata to achieve efficient file‑level pruning without modifying the 
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the 
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+//!
+//! **Compatibility note: why other Parquet readers simply skip over our extra 
index blob**
+//!
+//! Any standard Parquet reader will:
+//! 1. Seek to the end of the file and read the last 8Β bytes (a 4‑byte 
little‑endian footer length followed by the `PAR1` magic).
+//! 2. Seek backwards by that length to parse only the Thrift‑encoded footer 
metadata (including key/value pairs).
+//!
+//! Since our custom index bytes are appended *before* the footer (and we do 
not alter Parquet’s metadata schema), readers
+//! never scan from the file start or β€œoverflow” into our blob. They will 
encounter two unknown keys
+//! (`distinct_index_offset` and `distinct_index_length`) in the footer 
metadata, ignore them (or expose as extra metadata),
+//! and will not attempt to read or deserialize the raw index bytes.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a page‑index offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-30 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3021686409

   Updated the code with the merged PR: 
https://github.com/apache/datafusion/pull/16575
   And also added more comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-21 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993532737

   > How does it ensure that this extra index can be safely ignored by other 
readers? If another parquet reader implementation decides to do a sequential 
whole file scan, will it read into the extra custom data?
   
   I agree with what @zhuqi-lucas says too
   
   The way I think about this is that the parquet file's footer contains 
pointers (offsets) to the actual data in the file. There is no requirement that 
the footer points to all bytes within the file
   
   There are other interesting things that can be done with this setup too (for 
example, concatenating parquet files together without having to re-encode the 
data (you can just copy the bytes around and rewrite the footer) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-21 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993537466

   This is amazing @alamb ! Thanks!
   > There are other interesting things that can be done with this setup too 
(for example, concatenating parquet files together without having to re-encode 
the data (you can just copy the bytes around and rewrite the footer)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-21 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993531597

   FYI @XiangpengHao and @@JigaoLuo  -- here is another example of the somewhat 
crazy things you can do with parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-21 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993484604

   > wow this is so cool!
   > 
   > I have a question (and I think it's worth adding to the comment for people 
like me that's not familiar with parquet internals): How does it ensure that 
this extra index can be safely ignored by other readers? If another parquet 
reader implementation decides to do a sequential whole file scan, will it read 
into the extra custom data?
   
   Thank you for the review and great question, @2010YOUY01!
   
   **Short answer:**  
   Because we append our custom index *before* the Parquet footer and never 
modify the existing metadata schema, Parquet readers will still:
   
   1.  Seek to the **end of file** and read the last 8 bytes, which consist of: 
 
   - A 4‑byte little‑endian footer length  
   - The magic marker `PAR1`  
   2.  Jump back by that length to parse the Thrift‑encoded footer (and its 
key‑value list).  
   
   Any bytes you append *ahead* of the footer (i.e. after the data pages but 
before writing footer and magic) are simply skipped over by steps (1)&(2), 
because readers never scan from the file startβ€”they always locate the footer 
via the trailer magic and length.  
   
   **Why key/value metadata is safe:**  
   - We only **add** two new keys (`distinct_index_offset` and 
`distinct_index_length`) into the existing footer metadata map.  
   - All standard readers will see unknown keys and either ignore them or 
surface them as β€œextra metadata,” but they will not attempt to deserialize our 
custom binary blob.  
   - On our side, we:
   
  1. Read the Parquet footer as usual.  
  2. Extract our two key/value entries for offset+length.  
  3. `seek(offset)` + `read_exact(length)` to load the custom index and 
deserialize it.  
   
   Because every compliant Parquet reader must interpret the `PAR1` magic and 
footer length, none of them will ever β€œspill over” into our blob or treat it as 
data pages.
   
   I’ll add these details into the code comments. We’re also planning a blog 
post on Parquet indexing internals suggested by @alamb , thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


2010YOUY01 commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993326084

   wow this is so cool!
   
   I have a question (and I think it's worth adding to the comment for people 
like me that's not familiar with parquet internals):
   How does it ensure that this extra index can be safely ignored by other 
readers? If another parquet reader implementation decides to do a sequential 
whole file scan, will it read into the extra custom data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159141300


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into 
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a little‑endian length, to identify our 
custom index format
+//!- the serialized distinct‑values bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and 
`distinct_index_length`)
+//! 3. Read back each file’s footer metadata to locate and deserialize the 
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans 
footers
+//!into a map of filename β†’ `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesn’t 
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, application‑specific index directly 
in Parquet
+//! metadata to achieve efficient file‑level pruning without modifying the 
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the 
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a page‑index offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points to the
+/// │┃   ┃ β”‚ β”‚  custom index blob
+/// │┗━━━┛ β”‚
+/// │┏───┓ β”‚
+/// │┃ Page Index Offset ┃◀┼little‑endian u64
+/// │┗───┛ β”‚ β”‚  sitting after the custom index
+///

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2991840133

   > Thank you @zhuqi-lucas -- this is (really) cool. It is definitely blog 
post worthy (we have too many cool things that are blog worthy recently - and 
not enough time to write the blogs!)
   > 
   > Anyhow I left some other suggestions and will prioritiize getting this PR 
in upstream
   > 
   > * [Support write to buffer api for SerializedFileWriterΒ 
arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714)
   
   Thank you @alamb , i can try to write a blog about this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159136194


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.

Review Comment:
   Good suggestion, thank you @alamb !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159119104


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into 
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a little‑endian length, to identify our 
custom index format
+//!- the serialized distinct‑values bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and 
`distinct_index_length`)
+//! 3. Read back each file’s footer metadata to locate and deserialize the 
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans 
footers
+//!into a map of filename β†’ `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesn’t 
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, application‑specific index directly 
in Parquet
+//! metadata to achieve efficient file‑level pruning without modifying the 
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the 
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a page‑index offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points to the
+/// │┃   ┃ β”‚ β”‚  custom index blob
+/// │┗━━━┛ β”‚
+/// │┏───┓ β”‚
+/// │┃ Page Index Offset ┃◀┼little‑endian u64
+/// │┗───┛ β”‚ β”‚  sitting after the custom index
+///

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159067305


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index
+//! for pruning in DataFusion’s Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into 
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a little‑endian length, to identify our 
custom index format
+//!- the serialized distinct‑values bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and 
`distinct_index_length`)
+//! 3. Read back each file’s footer metadata to locate and deserialize the 
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans 
footers
+//!into a map of filename β†’ `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesn’t 
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, application‑specific index directly 
in Parquet
+//! metadata to achieve efficient file‑level pruning without modifying the 
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the 
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that

Review Comment:
   This is amazing -- thank you



##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom β€œdistinct values” index in Parquet 
files
+//!
+//! This example shows how to build and leverage a file‑level distinct‑values 
index

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-20 Thread via GitHub


alamb commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2991749429

   > The example print logs, it's good, thanks!
   
   
   this is so cool!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-19 Thread via GitHub


zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2987414899

   Thank you @alamb, 
   I am excited to update today that i resolve the page index conflicts by 
adding new API in arrow-rs which can write bytes to the buf, and it can make 
the buf-wirtten metrics consistent, and the buf-wirtten will be used by page 
index also, so it's safe now, and i enable the page index now for the example, 
the testing result is good!
   
   I am currently using this arrow-rs branch before the code merge:
   https://github.com/apache/arrow-rs/pull/7714
   
   
   The example print logs, it's good, thanks! 
   
   
   ```rust
   Writing values: [ByteArray { data: "foo" }, ByteArray { data: "bar" }, 
ByteArray { data: "foo" }]
   Writing custom index at offset: 68, length: 7
   Finished writing file to 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/a.parquet
   Writing values: [ByteArray { data: "baz" }, ByteArray { data: "qux" }]
   Writing custom index at offset: 68, length: 7
   Finished writing file to 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/b.parquet
   Writing values: [ByteArray { data: "foo" }, ByteArray { data: "quux" }, 
ByteArray { data: "quux" }]
   Writing custom index at offset: 70, length: 8
   Finished writing file to 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/c.parquet
   Reading index from 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/a.parquet (size: 
363)
   Reading index at offset: 68, length: 7
   Read distinct index for a.parquet: "a.parquet"
   Reading index from 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/b.parquet (size: 
363)
   Reading index at offset: 68, length: 7
   Read distinct index for b.parquet: "b.parquet"
   Reading index from 
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/c.parquet (size: 
368)
   Reading index at offset: 70, length: 8
   Read distinct index for c.parquet: "c.parquet"
   Filtering for category: foo
   Pruned files: ["c.parquet", "a.parquet"]
   +--+
   | category |
   +--+
   | foo  |
   | foo  |
   | foo  |
   +--+
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-18 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2154205143


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,363 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// We should disable page index support in the Parquet reader
+/// when we enable this feature, since we are using a custom index.
+///
+/// Example creating the parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+
+for entry in read_dir(&dir)? {
+let path = entry?.path();
+if path.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let file_name = 
path.file_name().unwrap().to_string_lossy().to_string

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-17 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153573700


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-17 Thread via GitHub


alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153064233


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+   

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-15 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148133565


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-15 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148133565


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-15 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148132009


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-15 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148132009


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-15 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r214877


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-14 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2146810315


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-13 Thread via GitHub


zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2144937636


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+ 

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-13 Thread via GitHub


alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2144899471


##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from 
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder, 
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// β”Œβ”€β”€β”
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚  Standard Parquet
+/// β”‚β””β”€β”€β”€β”˜ β”‚  Data / pages
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// β”‚...   β”‚
+/// β”‚  β”‚
+/// β”‚β”Œβ”€β”€β”€β” β”‚
+/// β”‚β”‚ DataPage  β”‚ β”‚
+/// β”‚β””β”€β”€β”€β”˜ β”‚
+/// │┏━━━┓ β”‚
+/// │┃   ┃ β”‚key/value metadata
+/// │┃   Special Index   ┃◀┼that points at the
+/// │┃   ┃ β”‚ β”‚  special index
+/// │┗━━━┛ β”‚
+/// │╔═══╗ β”‚ β”‚
+/// β”‚β•‘   β•‘ β”‚
+/// β”‚β•‘  Parquet Footer   β•‘ β”‚ β”‚  Footer includes
+/// β”‚β•‘   β•‘ ┼──  thrift-encoded
+/// β”‚β•‘   β•‘ β”‚ParquetMetadata
+/// β”‚β•šβ•β•β•β• β”‚
+/// β””β”€β”€β”˜
+///
+///   Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) = 
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key == 
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+