zhuqi-lucas commented on PR #16395:
URL: https://github.com/apache/datafusion/pull/16395#issuecomment-2993484604
> wow this is so cool!
>
> I have a question (and I think it's worth adding to the comment for people
like me that's not familiar with parquet internals): How does it ensure that
this extra index can be safely ignored by other readers? If another parquet
reader implementation decides to do a sequential whole file scan, will it read
into the extra custom data?
Thank you for the review and great question, @2010YOUY01!
**Short answer:**
Because we append our custom index *before* the Parquet footer and never
modify the existing metadata schema, Parquet readers will still:
1. Seek to the **end of file** and read the last 8 bytes, which consist of:
- A 4‑byte little‑endian footer length
- The magic marker `PAR1`
2. Jump back by that length to parse the Thrift‑encoded footer (and its
key‑value list).
Any bytes you append *ahead* of the footer (i.e. after the data pages but
before writing footer and magic) are simply skipped over by steps (1)&(2),
because readers never scan from the file start—they always locate the footer
via the trailer magic and length.
**Why key/value metadata is safe:**
- We only **add** two new keys (`distinct_index_offset` and
`distinct_index_length`) into the existing footer metadata map.
- All standard readers will see unknown keys and either ignore them or
surface them as “extra metadata,” but they will not attempt to deserialize our
custom binary blob.
- On our side, we:
1. Read the Parquet footer as usual.
2. Extract our two key/value entries for offset+length.
3. `seek(offset)` + `read_exact(length)` to load the custom index and
deserialize it.
Because every compliant Parquet reader must interpret the `PAR1` magic and
footer length, none of them will ever “spill over” into our blob or treat it as
data pages.
I’ll add these details into the code comments. We’re also planning a blog
post on Parquet indexing internals suggested by @alamb , thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]