This was very interesting Andrew, thanks for sharing. We've done something
quite similar at G-Research in the past but embedded the index directly in
the key value metadata. That has the advantage of not needing an extra IO
operation to read the index after you've read the footer, and it was simple
to implement, but the index needs to be stored as a UTF-8 string so will
usually be less compact than a binary representation and have more
deserialization overhead.

Cheers,
Adam


On Wed, 16 Jul 2025 at 23:16, wish maple <maplewish...@gmail.com> wrote:

> Seems good. Personally I think
>
> 1. Parquet file format seems have index page [1], but I don't know who's
> using it.
> 2. Currently, Parquet only have single column bloom filter and column
> index. Maybe
>     some kind of multi-column or other filter might work
> 3. Index can have different "levels", like Page Index is designed for
> "Page", and bloom
>     filter / statistics for RowGroup. We can even define index for "file"
>
> Currently I don't know whether we can have some "offcial" sample index.
> Personally I
> might be interested in some "sketches"
>
> Best,
> Xuwei Fu
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L655
>
> Andrew Lamb <andrewlam...@gmail.com> 于2025年7月16日周三 19:08写道:
>
> > I wrote a blog with Qi Zhu, Jigao Luo explaining how to embed user
> defined
> > indexes into Parquet files without needing any changes to the format[1].
> >
> > I am sorry for the somewhat shameless self promotion, but I think this
> > topic may be of general interest to the community in the context of other
> > extensions to the format we have discussed recently. Techniques such as
> > this widen potential usecases of  Parquet without any need for consensus
> or
> > timeline for ecosystem adoption.
> >
> > Andrew
> >
> > [1]:
> >
> https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
> >
>

Reply via email to