Hi, This looks very interesting and similar to a pattern we use in our custom file format (that we are gradually moving to parquet).
We use a skip list in a random accessible footer but this works by referencing column name (or name prefix) not IDs. The reason is we can have very wide files (10^6 to 10^9 columns) where a hash-table mapping would add too much overhead. https://www.datadoghq.com/blog/engineering/husky-storage-compaction/#the-fragment-file-storage-format I understand we dont want to generally waste space by mandating column name (even truncated) in footer, but would it make sense to support a custom metadata map in each skip list node to not block this kind of optimisation ? Thanks On 2026/03/27 10:32:44 Will Edwards via dev wrote: > Thx for the interest :D > > I ran some quick micro-benchmarks on my Java footer parser to see if the > idea might work. Sorry I can't share code, but I can share recipes etc. > > This is running on a weak cloud VM. The test files were created using an > obvious Python pyarrow script. The test runs a few hundred iterations for > warmup then runs a few thousand iterations of each test to get a timing. > > The *parquet-format* column is the > standard org.apache.parquet.format.Util.readFileMetaData(new > ByteArrayInputStream(footerBytes)). This excludes IO (the Hadoop libraries > are so slow, that's actually the number one thing to avoid). > > The *full* column shows my footer parser performing a full footer decode. > It spends about 5-10% of the time in the schema making the name tree and so > on; the rest of the time is spent on column metadata. It creates a long[] > of num_row_groups * num_columns * num_interesting_fields to decode into. > Strings/bytes are stored in the long[] as offsets into the Thrift footer. > This is about 8x+ faster than the standard library, and more importantly > for my use-cases it generates significantly less garbage. > > The *by ordinal* column shows my footer parser skipping all columns except > one. The skipping is barely faster than the full decode (you can't early > out after the interesting column in the last row-group if you are still > waiting for metadata etc) but, importantly for my use-cases, it uses > significantly less memory. > > The *single col* column is O(1) random access using a jump table into the > Thrift footer. This is the speed of the proposal. > > | Cols | RGs | Footer | parquet-format | full | by ordinal | single col | > |------|-----|--------|---------------|------|-----------|------------| > | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 | > | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 | > | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 | > | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 | > | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 | > | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 | > | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 | > | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 | > | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 | > | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 | > | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 | > | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 | > | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 | > | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 | > | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 | > > Performance is measured in milliseconds. My fast footer parser is very > consistent, but the parquet-format has higher variability because of > garbage collections, etc. > > The O(1) access is very stable: approximately 100 ns to create the arrays > etc for the result and then approximately 270 ns per decoded ColumnChunk. > > The jump table extension might work like this: we have a normal Thrift > footer, and immediately after the normal version field, we add a new field > that signals the presence of a footer index. This is a binary field with a > new high field-id. It doesn't have to contain the index; that can be > sandwiched between the end of the Thrift footer and the end of the file. > It contains an uncompressed long that is the offset to the footer index's > start. Because the field is not a varint, a writer can patch it after > writing the footer etc. Readers who don't know about the index will just > skip it. > > The footer index has a little header indicating its contents: whether it > includes a name hash table, a column ordinals jump table etc. These are > all flat, fixed-size arrays. There is also lists the positions for any > fields not in the index e.g. the position of the metadata and items the > reader may want to access by skipping the parts that the index allows them > to. > > Would love to know how Java performance compares to rust etc :D > > Yours hopefully, > Will > > On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]> wrote: > > > Thank you Will, > > > > I think this idea is very interesting. > > > > > it’s entirely possible this index-only approach was evaluated and > > discarded early on. > > > > I don't know of any such discussion or evaluation. One challenge of this > > approach might be getting thrift-compiler generated parsers to cooperate. > > > > > would love to hear your thoughts on whether a lightweight index could > > give us the O(1) read performance we want without the file bloat we are > > trying to avoid. > > > > I personally think it sounds like a great idea -- and I suggest as a next > > step a prototype in one of the open source parquet readers so we can > > measure its performance. The arrow-rs implementation (now) has its own > > custom thrift decoder, so it might be relatively easy to test out such a > > scheme. > > > > Andrew > > >
