Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-06 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3041327170

   Thank you @alamb , a minor topic is i may pick up this:
   
   http://github.com/apache/datafusion/pull/13933
   
   To use this user-defined index or parquet SortColumn metadata. So we can 
restore the sort info in a better way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-06 Thread via GitHub


alamb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3041322628

   > User-Defined Index.
   
   I think this is a really good term -- I will update the blog post in 
https://github.com/apache/datafusion-site/pull/79 to use that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-06 Thread via GitHub


alamb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3041320800

   > Thank you [@alamb](https://github.com/alamb) 
[@JigaoLuo](https://github.com/JigaoLuo) 
[@adriangb](https://github.com/adriangb) , i agree current example is the 
start, we can further add more advanced examples!
   
   I also made a PR to clarify the comments in the example
   - https://github.com/apache/datafusion/pull/16692


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3040702910

   Thank you @alamb @JigaoLuo @adriangb , i agree current example is the start, 
we can further add more advanced examples!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039853245

   > > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas),
   > > While proofreading the blog, I had one major general question: **What 
are the limitations of such an embedded index?**
   > > 
   > > * Is it limited to just one embedded index per file?
   > 
   > No -- you could put as many indexes as you want (of course each new index 
will consume space in the file and add something to the metadata
   > 
   > > * Is it only possible to have a file-level index? (From the example, it 
seems like the hashset index is only applied at the file level.)
   > 
   > No, it is possible to have indexes with whatever granularity you want (
   > 
   > > I imagine other blog readers might have similar questions about the 
limitations—or the potential—of this embedded_index approach.
   > 
   > Yes it is a good point -- we should make sure to point this out on the blog
   > 
   > > If there are no strict limitations, then my follow-up discussion is: 
Could we potentially **supercharge** Parquet with techniques inspired by 
proprietary file formats? For example:
   > > * A true HyperLogLog
   > > * Small materialized aggregates (like precomputed sums at the column 
chunk or data page level) [For example with Clickbench Q3: a global AVG just 
needs the metadata, once the precomputed sum and the total rowcount are there.]
   > > * Even histograms or hashsets at the row group level (which would be a 
much more powerful version of min-max indexing for pruning)
   > 
   > Absolutely! Maybe just for fun someone could cook up special indexes for 
each clickbench query and put it in the files -- and we could show some truly 
crazy speed
   > 
   > The point would not be that any of those indexes is actually general 
purpose, but that parquet lets you put whatever. you like in it
   
   Thanks! This gave me the impression of a kind of **User-Defined Index**. I 
can now imagine that users could embed arbitrary binary data into this section 
of Parquet. As long as the Parquet reader knows how to interpret that binary 
using a corresponding **User-Defined Index Function**, it could enable powerful 
capabilities, such as pruning & precomputed results for query processing, or 
even query optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub


adriangb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039806254

   Index suggestion: a tablesample index.
   
   And a general thought: exploring these sorts of indexes could do very cool 
stuff for DataFusion in general in terms of pushing us to develop good APIs to 
make use of things like stats in join optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub


alamb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047

   > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas),
   > 
   > While proofreading the blog, I had one major general question: **What are 
the limitations of such an embedded index?**
   > 
   > * Is it limited to just one embedded index per file?
   
   
   No -- you could put as many indexes as you want (of course each new index 
will consume space in the file and add something to the metadata
   
   > * Is it only possible to have a file-level index? (From the example, it 
seems like the hashset index is only applied at the file level.)
   
   No, it is possible to have indexes with whatever granularity you want (
   
   
   > I imagine other blog readers might have similar questions about the 
limitations—or the potential—of this embedded_index approach.
   
   
   Yes it is a good point -- we should make sure to point this out on the blog
   
If there are no strict limitations, then my follow-up discussion is: Could 
we potentially **supercharge** Parquet with techniques inspired by proprietary 
file formats? For example:
   > 
   > * A true HyperLogLog
   > * Small materialized aggregates (like precomputed sums at the column chunk 
or data page level) [For example with Clickbench Q3: a global AVG just needs 
the metadata, once the precomputed sum and the total rowcount are there.]
   > * Even histograms or hashsets at the row group level (which would be a 
much more powerful version of min-max indexing for pruning)
   
   
   Absolutely! Maybe just for fun someone could cook up special indexes for 
each clickbench query and put it in the files -- and we could show some truly 
crazy speed
   
   The point would not be that any of those indexes is actually general 
purpose, but that parquet lets you put whatever. you like in it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039204719

   Hi @zhuqi-lucas,
   
   While proofreading the blog, I had one major general question: **What are 
the limitations of such an embedded index?**
   - Is it limited to just one embedded index per file?
   - Is it only possible to have a file-level index? (From the example, it 
seems like the hashset index is only applied at the file level.)
   
   I imagine other blog readers might have similar questions about the 
limitations—or the potential—of this embedded_index approach.
   
   If there are no strict limitations, then my follow-up discussion is: Could 
we potentially **supercharge** Parquet with techniques inspired by proprietary 
file formats? For example:
   - A true HyperLogLog
   - Small materialized aggregates (like precomputed sums at the column chunk 
or data page level) [For example with Clickbench Q3: a global AVG just needs 
the metadata, once the precomputed sum and the total rowcount are there.]
   - Even histograms or hashsets at the row group level (which would be a much 
more powerful version of min-max indexing for pruning)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-03 Thread via GitHub


alamb closed issue #16374: Add an example of embedding indexes *inside* a 
parquet file
URL: https://github.com/apache/datafusion/issues/16374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-21 Thread via GitHub


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391

   @alamb @zhuqi-lucas Thank you for this issue and the PR. This could 
significantly aid query processing on Parquet. 
   
   I was previously **never** aware of `key_value_metadata` and am grateful for 
the insight: today marks my first discovery of its presence in both 
[ColumnMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L900)
 and 
[FileMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1267).
 Also @alamb's argument also reminded me of a paper from the German DB 
Conference: 
https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content
  for reference and
   
   
   
   at the end of Section 2.3 of it:
   
   > The only statistics available in Parquet files are the cardinality of the 
contained dataset and
   each page’s minimum and maximum values. Unfortunately, the minimum and 
maximum values are optional fields, so Parquet writers are not forced to use 
them. ...  These minimum and maximum values, as well as the cardinality of the 
datasets, are the only sources available for performing cardinality estimates. 
Therefore, we get imprecise results since we do not know how the data is 
distributed within the given boundaries. As a consequence, we get erroneous 
cardinality estimates and suboptimal query plans.
   
   > ...  This shows how crucial a good cardinality estimate is for a Parquet 
scan to be
   an acceptable alternative to database relations. The Parquet scan cannot get 
close to the
   execution times of database relations as long as the query optimizer cannot 
choose the same query plans for the Parquet files
   
   
   
   In my experience, there’s a **widespread underappreciation for the 
configurability of Parquet files**. Many practitioners default to blaming 
Parquet’s performance or feature limitations, such as HLL. This often leads to 
unfair comparisons with proprietary formats, which are fine-tuned and 
cherry-picked.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-13 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2969754548

   I am also preparing to cook a advanced_embedding_indexes later after the 
simple one merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-13 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2969605248

   Thank you @alamb @adriangb , submit a simple example PR for review, i can 
add more examples follow-up:
   
   https://github.com/apache/datafusion/pull/16395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-12 Thread via GitHub


adriangb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2967987004

   Very excited about this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-12 Thread via GitHub


alamb commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2966448392

   Nice @zhuqi-lucas  -- BTW I am not sure how easy it will be to use the 
parquet APIs to do this (specifically write arbitrary bytes to the inner 
writer) so it may take some fiddling / using the lower level API / adding a new 
API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-12 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2966416266

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-12 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2966419013

   I am interested in this, and i want to be familiar with embedding indexes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-06-12 Thread via GitHub


zhuqi-lucas commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2966460212

   Thank you @alamb, i will investigate and explore the APIs and see what’s 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]