Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3034804227 Thank you @alamb , i will submit a draft blog soon in: https://github.com/apache/datafusion/issues/16372 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3033685921 This is so great -- now we just need to write up a blog post π£ Thanks again @zhuqi-lucas -- this is going to be great -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb merged PR #16395: URL: https://github.com/apache/datafusion/pull/16395 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
jcsherin commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027788574 @alamb The overview documentation is very clear and love the ASCII art. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
jcsherin commented on code in PR #16395: URL: https://github.com/apache/datafusion/pull/16395#discussion_r2179977087 ## datafusion-examples/examples/parquet_embedded_index.rs: ## @@ -0,0 +1,472 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Embedding and using a custom index in Parquet files +//! +//! # Background +//! +//! This example shows how to add an applicationβspecific index to an Apache +//! Parquet file without modifying the Parquet format itself. The resulting +//! files can be read by any standard Parquet reader, which will simply +//! ignore the extra index data. +//! +//! A βdistinct valueβ index, similar to a ["set" Skip Index in ClickHouse], +//! is stored in a custom binary format within the parquet file. Only the +//! location of index is stored in Parquet footer key/value metadata. +//! This approach is more efficient than storing the index itself in the footer +//! metadata because the footer must be read and parsed by all readers, +//! even those that do not use the index. +//! +//! The resulting Parquet file layout is as follows: +//! +//! ```text +//! ββββ +//! ββββββ β +//! ββ DataPage β β +//! ββββββ β +//! Standard Parquet ββββββ β +//! Data Pages ββ DataPage β β +//! ββββββ β +//! β... β +//! ββββββ β +//! ββ DataPage β β +//! ββββββ β +//! ββββββ β +//! Non standard ββ β β +//! index (ignored by ββCustom Binary Indexβ β +//! other Parquet ββ (Distinct Values) ββββ β β +//! readers) ββ β β β +//! ββββββ β +//! Standard Parquet ββββββ β β key/value metadata +//! Page IndexββPage Index β βcontains location +//! ββββββ β β of special index +//! ββββββ β +//! ββ Parquet Footer w/ β β β +//! ββ Metadata β βΌ β β +//! ββ (Thrift Encoded) β β +//! ββββββ β +//! ββββ +//! +//! Parquet File +//! +//! # High Level Flow +//! +//! To create a custom Parquet index: +//! +//! 1. Compute the index and serialize it to a binary format. +//! +//! 2. Write the Parquet file with: +//!- regular data pages +//!- the serialized index inline +//!- footer key/value metadata entry to locate the index +//! +//! To read and use the index are: +//! +//! 1. Read and deserialize the fileβs footer to locate the index. +//! +//! 2. Read and deserialize the index. +//! +//! 3. Create a `TableProvider` that knows how to use the index to quickly find +//! the relevant files, row groups, data pages or rows based on on pushed down +//! filters. +//! +//! # FAQ: Why do other Parquet readers skip over the custom index? +//! +//! The flow for reading a parquet file is: +//! +//! 1. Seek to the end of the file and read the last 8 bytes (a 4βbyte +//!littleβendian footer length followed by the `PAR1` magic bytes). +//! +//! 2. Seek backwards by that length to parse the Thriftβencoded footer +//!
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027610547 > * 54a9e61 Thank you @alamb looks great to me! > Simplified the code to only write the offset index (the length is stored inline) Perfect for this change! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027590526 I think it is now ready to merge, but it would probably be good for someone else to go over it one last time to make sure it is clear -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3027588395 Hi @zhuqi-lucas -- spent a while this morning going over this PR carefully -- it is great! I hope you don't mind but I made some substantial edits to try and make it read a bit better: 1. Revamped the documentation and overview 2. Updated the ASCII art 3. moved reading the index into DistinctIndex 4. Added a bunch more comments 5. Simplified the code to only write the offset index (the length is stored inline) In my mind none of this was required but since I plan to make a Huge Deal (TM) about this example publically I figured spending some extra time polishing it would be worthwhle - [54a9e61](https://github.com/apache/datafusion/pull/16395/commits/54a9e610ed64448f95fe8526129871c63a8efcff) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3026260872 > Thank you @zhuqi-lucas -- I started going through this PR again in detail > > I renamed the example to align with the other parquet examples, and I added it to the list of examples. > > I also took a pass through the comments. > > I have run out of time today, but I'll finish it up first thing tomorrow and hopefully merge > > Thank you again so much! Thank you very much @alamb ! This looks pretty good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2178933212
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
+//! for pruning in DataFusionβs Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a littleβendian length, to identify our
custom index format
+//!- the serialized distinctβvalues bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and
`distinct_index_length`)
+//! 3. Read back each fileβs footer metadata to locate and deserialize the
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans
footers
+//!into a map of filename β `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesnβt
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, applicationβspecific index directly
in Parquet
+//! metadata to achieve efficient fileβlevel pruning without modifying the
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+//!
+//! **Compatibility note: why other Parquet readers simply skip over our extra
index blob**
+//!
+//! Any standard Parquet reader will:
+//! 1. Seek to the end of the file and read the last 8Β bytes (a 4βbyte
littleβendian footer length followed by the `PAR1` magic).
+//! 2. Seek backwards by that length to parse only the Thriftβencoded footer
metadata (including key/value pairs).
+//!
+//! Since our custom index bytes are appended *before* the footer (and we do
not alter Parquetβs metadata schema), readers
+//! never scan from the file start or βoverflowβ into our blob. They will
encounter two unknown keys
+//! (`distinct_index_offset` and `distinct_index_length`) in the footer
metadata, ignore them (or expose as extra metadata),
+//! and will not attempt to read or deserialize the raw index bytes.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a pageβindex offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2178596921
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
+//! for pruning in DataFusionβs Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a littleβendian length, to identify our
custom index format
+//!- the serialized distinctβvalues bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and
`distinct_index_length`)
+//! 3. Read back each fileβs footer metadata to locate and deserialize the
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans
footers
+//!into a map of filename β `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesnβt
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, applicationβspecific index directly
in Parquet
+//! metadata to achieve efficient fileβlevel pruning without modifying the
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+//!
+//! **Compatibility note: why other Parquet readers simply skip over our extra
index blob**
+//!
+//! Any standard Parquet reader will:
+//! 1. Seek to the end of the file and read the last 8Β bytes (a 4βbyte
littleβendian footer length followed by the `PAR1` magic).
+//! 2. Seek backwards by that length to parse only the Thriftβencoded footer
metadata (including key/value pairs).
+//!
+//! Since our custom index bytes are appended *before* the footer (and we do
not alter Parquetβs metadata schema), readers
+//! never scan from the file start or βoverflowβ into our blob. They will
encounter two unknown keys
+//! (`distinct_index_offset` and `distinct_index_length`) in the footer
metadata, ignore them (or expose as extra metadata),
+//! and will not attempt to read or deserialize the raw index bytes.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a pageβindex offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β...
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-3021686409 Updated the code with the merged PR: https://github.com/apache/datafusion/pull/16575 And also added more comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993532737 > How does it ensure that this extra index can be safely ignored by other readers? If another parquet reader implementation decides to do a sequential whole file scan, will it read into the extra custom data? I agree with what @zhuqi-lucas says too The way I think about this is that the parquet file's footer contains pointers (offsets) to the actual data in the file. There is no requirement that the footer points to all bytes within the file There are other interesting things that can be done with this setup too (for example, concatenating parquet files together without having to re-encode the data (you can just copy the bytes around and rewrite the footer) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993537466 This is amazing @alamb ! Thanks! > There are other interesting things that can be done with this setup too (for example, concatenating parquet files together without having to re-encode the data (you can just copy the bytes around and rewrite the footer) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993531597 FYI @XiangpengHao and @@JigaoLuo -- here is another example of the somewhat crazy things you can do with parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993484604 > wow this is so cool! > > I have a question (and I think it's worth adding to the comment for people like me that's not familiar with parquet internals): How does it ensure that this extra index can be safely ignored by other readers? If another parquet reader implementation decides to do a sequential whole file scan, will it read into the extra custom data? Thank you for the review and great question, @2010YOUY01! **Short answer:** Because we append our custom index *before* the Parquet footer and never modify the existing metadata schema, Parquet readers will still: 1. Seek to the **end of file** and read the last 8 bytes, which consist of: - A 4βbyte littleβendian footer length - The magic marker `PAR1` 2. Jump back by that length to parse the Thriftβencoded footer (and its keyβvalue list). Any bytes you append *ahead* of the footer (i.e. after the data pages but before writing footer and magic) are simply skipped over by steps (1)&(2), because readers never scan from the file startβthey always locate the footer via the trailer magic and length. **Why key/value metadata is safe:** - We only **add** two new keys (`distinct_index_offset` and `distinct_index_length`) into the existing footer metadata map. - All standard readers will see unknown keys and either ignore them or surface them as βextra metadata,β but they will not attempt to deserialize our custom binary blob. - On our side, we: 1. Read the Parquet footer as usual. 2. Extract our two key/value entries for offset+length. 3. `seek(offset)` + `read_exact(length)` to load the custom index and deserialize it. Because every compliant Parquet reader must interpret the `PAR1` magic and footer length, none of them will ever βspill overβ into our blob or treat it as data pages. Iβll add these details into the code comments. Weβre also planning a blog post on Parquet indexing internals suggested by @alamb , thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
2010YOUY01 commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993326084 wow this is so cool! I have a question (and I think it's worth adding to the comment for people like me that's not familiar with parquet internals): How does it ensure that this extra index can be safely ignored by other readers? If another parquet reader implementation decides to do a sequential whole file scan, will it read into the extra custom data? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159141300
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
+//! for pruning in DataFusionβs Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a littleβendian length, to identify our
custom index format
+//!- the serialized distinctβvalues bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and
`distinct_index_length`)
+//! 3. Read back each fileβs footer metadata to locate and deserialize the
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans
footers
+//!into a map of filename β `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesnβt
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, applicationβspecific index directly
in Parquet
+//! metadata to achieve efficient fileβlevel pruning without modifying the
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a pageβindex offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points to the
+/// ββ β β β custom index blob
+/// ββββββ β
+/// ββββββ β
+/// ββ Page Index Offset βββΌlittleβendian u64
+/// ββββββ β β sitting after the custom index
+///
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2991840133 > Thank you @zhuqi-lucas -- this is (really) cool. It is definitely blog post worthy (we have too many cool things that are blog worthy recently - and not enough time to write the blogs!) > > Anyhow I left some other suggestions and will prioritiize getting this PR in upstream > > * [Support write to buffer api for SerializedFileWriterΒ arrow-rs#7714](https://github.com/apache/arrow-rs/pull/7714) Thank you @alamb , i can try to write a blog about this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395: URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159136194 ## datafusion-examples/examples/embedding_parquet_indexes.rs: ## @@ -0,0 +1,380 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Example: embedding and using a custom βdistinct valuesβ index in Parquet files +//! +//! This example shows how to build and leverage a fileβlevel distinctβvalues index +//! for pruning in DataFusionβs Parquet scans. Review Comment: Good suggestion, thank you @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159119104
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
+//! for pruning in DataFusionβs Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a littleβendian length, to identify our
custom index format
+//!- the serialized distinctβvalues bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and
`distinct_index_length`)
+//! 3. Read back each fileβs footer metadata to locate and deserialize the
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans
footers
+//!into a map of filename β `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesnβt
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, applicationβspecific index directly
in Parquet
+//! metadata to achieve efficient fileβlevel pruning without modifying the
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
+/// contains specialized indexes and a pageβindex offset
+///
+/// Note: the page index offset will after the custom index, which
+/// is originally after the data pages.
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points to the
+/// ββ β β β custom index blob
+/// ββββββ β
+/// ββββββ β
+/// ββ Page Index Offset βββΌlittleβendian u64
+/// ββββββ β β sitting after the custom index
+///
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2159067305
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
+//! for pruning in DataFusionβs Parquet scans.
+//!
+//! Steps:
+//! 1. Compute the distinct values for a target column and serialize them into
bytes.
+//! 2. Write each Parquet file with:
+//!- regular data pages for your column
+//!- the magic marker `IDX1` and a littleβendian length, to identify our
custom index format
+//!- the serialized distinctβvalues bytes
+//!- footer key/value metadata entries (`distinct_index_offset` and
`distinct_index_length`)
+//! 3. Read back each fileβs footer metadata to locate and deserialize the
index.
+//! 4. Build a `DistinctIndexTable` (a custom `TableProvider`) that scans
footers
+//!into a map of filename β `HashSet` of distinct values.
+//! 5. In `scan()`, prune out any Parquet files whose distinct set doesnβt
match the
+//!`category = 'X'` filter, then only read data from the remaining files.
+//!
+//! This technique embeds a lightweight, applicationβspecific index directly
in Parquet
+//! metadata to achieve efficient fileβlevel pruning without modifying the
Parquet format.
+//!
+//! And it's very efficient, since we don't add any additional info to the
metadata, we write the custom index
+//! after the data pages, and we only read it when needed.
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+///
+/// Example creating the Parquet file that
Review Comment:
This is amazing -- thank you
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,380 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding and using a custom βdistinct valuesβ index in Parquet
files
+//!
+//! This example shows how to build and leverage a fileβlevel distinctβvalues
index
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on PR #16395: URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2991749429 > The example print logs, it's good, thanks! this is so cool! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2987414899
Thank you @alamb,
I am excited to update today that i resolve the page index conflicts by
adding new API in arrow-rs which can write bytes to the buf, and it can make
the buf-wirtten metrics consistent, and the buf-wirtten will be used by page
index also, so it's safe now, and i enable the page index now for the example,
the testing result is good!
I am currently using this arrow-rs branch before the code merge:
https://github.com/apache/arrow-rs/pull/7714
The example print logs, it's good, thanks!
```rust
Writing values: [ByteArray { data: "foo" }, ByteArray { data: "bar" },
ByteArray { data: "foo" }]
Writing custom index at offset: 68, length: 7
Finished writing file to
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/a.parquet
Writing values: [ByteArray { data: "baz" }, ByteArray { data: "qux" }]
Writing custom index at offset: 68, length: 7
Finished writing file to
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/b.parquet
Writing values: [ByteArray { data: "foo" }, ByteArray { data: "quux" },
ByteArray { data: "quux" }]
Writing custom index at offset: 70, length: 8
Finished writing file to
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/c.parquet
Reading index from
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/a.parquet (size:
363)
Reading index at offset: 68, length: 7
Read distinct index for a.parquet: "a.parquet"
Reading index from
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/b.parquet (size:
363)
Reading index at offset: 68, length: 7
Read distinct index for b.parquet: "b.parquet"
Reading index from
/var/folders/q7/zjtv8rvx2hz0_t_rjjq8p9k0gp/T/.tmp9zCIJt/c.parquet (size:
368)
Reading index at offset: 70, length: 8
Read distinct index for c.parquet: "c.parquet"
Filtering for category: foo
Pruned files: ["c.parquet", "a.parquet"]
+--+
| category |
+--+
| foo |
| foo |
| foo |
+--+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2154205143
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,363 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowSchemaConverter;
+use datafusion::parquet::data_type::{ByteArray, ByteArrayType};
+use datafusion::parquet::errors::ParquetError;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::parquet::file::writer::SerializedFileWriter;
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// We should disable page index support in the Parquet reader
+/// when we enable this feature, since we are using a custom index.
+///
+/// Example creating the parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+
+for entry in read_dir(&dir)? {
+let path = entry?.path();
+if path.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let file_name =
path.file_name().unwrap().to_string_lossy().to_string
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153573700
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153064233
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148133565
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148133565
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148132009
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2148132009
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r214877
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2146810315
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
zhuqi-lucas commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2144937636
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]
alamb commented on code in PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#discussion_r2144899471
##
datafusion-examples/examples/embedding_parquet_indexes.rs:
##
@@ -0,0 +1,243 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Example: embedding a "distinct values" index in a Parquet file's metadata
+//!
+//! 1. Read existing Parquet files
+//! 2. Compute distinct values for a target column using DataFusion
+//! 3. Serialize the distinct index to bytes and write to the new Parquet file
+//!with these encoded bytes appended as a custom metadata entry
+//! 4. Read each new parquet file, extract and deserialize the index from
footer
+//! 5. Use the distinct index to prune files when querying
+
+use arrow::array::{ArrayRef, StringArray};
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use async_trait::async_trait;
+use base64::engine::general_purpose;
+use base64::Engine;
+use datafusion::catalog::{Session, TableProvider};
+use datafusion::common::{HashMap, HashSet, Result};
+use datafusion::datasource::listing::PartitionedFile;
+use datafusion::datasource::memory::DataSourceExec;
+use datafusion::datasource::physical_plan::{FileScanConfigBuilder,
ParquetSource};
+use datafusion::datasource::TableType;
+use datafusion::execution::object_store::ObjectStoreUrl;
+use datafusion::logical_expr::{Operator, TableProviderFilterPushDown};
+use datafusion::parquet::arrow::ArrowWriter;
+use datafusion::parquet::file::metadata::KeyValue;
+use datafusion::parquet::file::properties::WriterProperties;
+use datafusion::parquet::file::reader::{FileReader, SerializedFileReader};
+use datafusion::physical_plan::ExecutionPlan;
+use datafusion::prelude::*;
+use datafusion::scalar::ScalarValue;
+use std::fs::{create_dir_all, read_dir, File};
+use std::path::{Path, PathBuf};
+use std::sync::Arc;
+use tempfile::TempDir;
+
+/// Example creating parquet file that
+/// contains specialized indexes that
+/// are ignored by other readers
+///
+/// ```text
+/// ββββ
+/// ββββββ β
+/// ββ DataPage β β Standard Parquet
+/// ββββββ β Data / pages
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// β... β
+/// β β
+/// ββββββ β
+/// ββ DataPage β β
+/// ββββββ β
+/// ββββββ β
+/// ββ β βkey/value metadata
+/// ββ Special Index βββΌthat points at the
+/// ββ β β β special index
+/// ββββββ β
+/// ββββββ β β
+/// ββ β β
+/// ββ Parquet Footer β β β Footer includes
+/// ββ β βΌββ thrift-encoded
+/// ββ β βParquetMetadata
+/// ββββββ β
+/// ββββ
+///
+/// Parquet File
+/// ```
+/// DistinctIndexTable is a custom TableProvider that reads Parquet files
+#[derive(Debug)]
+struct DistinctIndexTable {
+schema: SchemaRef,
+index: HashMap>,
+dir: PathBuf,
+}
+
+impl DistinctIndexTable {
+/// Scan a directory, read each file's footer metadata into a map
+fn try_new(dir: impl Into, schema: SchemaRef) -> Result {
+let dir = dir.into();
+let mut index = HashMap::new();
+for entry in read_dir(&dir)? {
+let p = entry?.path();
+if p.extension().and_then(|s| s.to_str()) != Some("parquet") {
+continue;
+}
+let name = p.file_name().unwrap().to_string_lossy().into_owned();
+let reader = SerializedFileReader::new(File::open(&p)?)?;
+if let Some(kv) =
reader.metadata().file_metadata().key_value_metadata() {
+if let Some(e) = kv.iter().find(|kv| kv.key ==
"distinct_index_data") {
+let raw = general_purpose::STANDARD_NO_PAD
+
