JigaoLuo commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039204719
Hi @zhuqi-lucas, While proofreading the blog, I had one major general question: **What are the limitations of such an embedded index?** - Is it limited to just one embedded index per file? - Is it only possible to have a file-level index? (From the example, it seems like the hashset index is only applied at the file level.) I imagine other blog readers might have similar questions about the limitations—or the potential—of this embedded_index approach. If there are no strict limitations, then my follow-up discussion is: Could we potentially **supercharge** Parquet with techniques inspired by proprietary file formats? For example: - A true HyperLogLog - Small materialized aggregates (like precomputed sums at the column chunk or data page level) [For example with Clickbench Q3: a global AVG just needs the metadata, once the precomputed sum and the total rowcount are there.] - Even histograms or hashsets at the row group level (which would be a much more powerful version of min-max indexing for pruning) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org