Hey folks,

we have found that DuckDB produces "too long" bitpacks at the end of an RLE
encoded page. That is, it creates a bitpack of length 256 even though there
are just 14 entries in the page. See here for details (github login
required): https://github.com/duckdb/duckdb/discussions/22089

Our engine currently rejects reading this and flags it as corrupted. But
the question is now: Who's in the right here? Is this legal? E.g., is it
okay for a page with 5 entries to have an RLE bitpack with 256 entries?

We read the spec multiple times but it seems that it's underspecified. It
is somewhat implicitly clear that you are at least allowed to pad the run
at the end of a page to the next multiple of 8, as bitpack runs length
*can* only be a multiple of 8, so a page with 5 entries has no way of
encoding these in a bit pack of the correct size (but that is also implicit
and not mentioned). But is it okay to go larger than this? E.g., 256 as in
the case of DuckDB.

The spec mandates that pages write their contents (R-levels, D-levels,
Data) behind each other without padding. But nothing says whether the
number of encoded entries may be larger than the actual number of entries
in the page or whether runs may be larger than required at the end of a
page.

There are some sentences on the delta encoding, but no similar sentences
for RLE:

If there are not enough values to fill the last miniblock, we pad the
> miniblock so that its length is always the number of values in a full
> miniblock multiplied by the bit width. The values of the padding bits
> should be zero, but readers must accept paddings consisting of arbitrary
> bits as well.


So, what's the spec-lawyer answer here? And depending on what that answer
is, should we update the Parquet spec to clarify this case; it seems to be
truely underspecified.

Cheers,
Jan

Reply via email to