Hi,

We ran into a related issue to this in arrow-rs when implementing late materialization [1]. As you point out the fact that the encoding stores the length of a run / 8 means decoders already need additional information from the page metadata to correctly interpret RLE data anyway, having additional padding doesn't introduce any additional ambiguity beyond this.

I definitely think this quirk of the encoding should be called out more clearly in the specification. With regards to writing additional padding, perhaps we could appeal to the robustness principle and discourage writers from producing unnecessary padding, but state that readers should be tolerant of it given there are writers in the wild producing such data?

Kind Regards,

Raphael Taylor-Davies

[1]: https://github.com/apache/arrow-rs/pull/4319

On 15/04/2026 15:53, Jan Finis wrote:
Hey folks,

we have found that DuckDB produces "too long" bitpacks at the end of an RLE
encoded page. That is, it creates a bitpack of length 256 even though there
are just 14 entries in the page. See here for details (github login
required): https://github.com/duckdb/duckdb/discussions/22089

Our engine currently rejects reading this and flags it as corrupted. But
the question is now: Who's in the right here? Is this legal? E.g., is it
okay for a page with 5 entries to have an RLE bitpack with 256 entries?

We read the spec multiple times but it seems that it's underspecified. It
is somewhat implicitly clear that you are at least allowed to pad the run
at the end of a page to the next multiple of 8, as bitpack runs length
*can* only be a multiple of 8, so a page with 5 entries has no way of
encoding these in a bit pack of the correct size (but that is also implicit
and not mentioned). But is it okay to go larger than this? E.g., 256 as in
the case of DuckDB.

The spec mandates that pages write their contents (R-levels, D-levels,
Data) behind each other without padding. But nothing says whether the
number of encoded entries may be larger than the actual number of entries
in the page or whether runs may be larger than required at the end of a
page.

There are some sentences on the delta encoding, but no similar sentences
for RLE:

If there are not enough values to fill the last miniblock, we pad the
miniblock so that its length is always the number of values in a full
miniblock multiplied by the bit width. The values of the padding bits
should be zero, but readers must accept paddings consisting of arbitrary
bits as well.

So, what's the spec-lawyer answer here? And depending on what that answer
is, should we update the Parquet spec to clarify this case; it seems to be
truely underspecified.

Cheers,
Jan

Reply via email to