This was very interesting Andrew, thanks for sharing. We've done something quite similar at G-Research in the past but embedded the index directly in the key value metadata. That has the advantage of not needing an extra IO operation to read the index after you've read the footer, and it was simple to implement, but the index needs to be stored as a UTF-8 string so will usually be less compact than a binary representation and have more deserialization overhead.
Cheers, Adam On Wed, 16 Jul 2025 at 23:16, wish maple <maplewish...@gmail.com> wrote: > Seems good. Personally I think > > 1. Parquet file format seems have index page [1], but I don't know who's > using it. > 2. Currently, Parquet only have single column bloom filter and column > index. Maybe > some kind of multi-column or other filter might work > 3. Index can have different "levels", like Page Index is designed for > "Page", and bloom > filter / statistics for RowGroup. We can even define index for "file" > > Currently I don't know whether we can have some "offcial" sample index. > Personally I > might be interested in some "sketches" > > Best, > Xuwei Fu > > [1] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655 > > Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道: > > > I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user > defined > > indexes into Parquet files without needing any changes to the format[1]. > > > > I am sorry for the somewhat shameless self promotion, but I think this > > topic may be of general interest to the community in the context of other > > extensions to the format we have discussed recently. Techniques such as > > this widen potential usecases of Parquet without any need for consensus > or > > timeline for ecosystem adoption. > > > > Andrew > > > > [1]: > > > https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ > > >