Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]

via GitHub Sat, 05 Jul 2025 08:43:06 -0700


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039204719


   Hi @zhuqi-lucas,
   
   While proofreading the blog, I had one major general question: **What are 
the limitations of such an embedded index?**
   - Is it limited to just one embedded index per file?
   - Is it only possible to have a file-level index? (From the example, it 
seems like the hashset index is only applied at the file level.)
   
   I imagine other blog readers might have similar questions about the 
limitations—or the potential—of this embedded_index approach.
   
   If there are no strict limitations, then my follow-up discussion is: Could 
we potentially **supercharge** Parquet with techniques inspired by proprietary 
file formats? For example:
   - A true HyperLogLog
   - Small materialized aggregates (like precomputed sums at the column chunk 
or data page level) [For example with Clickbench Q3: a global AVG just needs 
the metadata, once the precomputed sum and the total rowcount are there.]
   - Even histograms or hashsets at the row group level (which would be a much 
more powerful version of min-max indexing for pruning)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

Reply via email to

Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]