>
> I definitely think this quirk of the encoding should be called out more
> clearly in the specification. With regards to writing additional
> padding, perhaps we could appeal to the robustness principle and
> discourage writers from producing unnecessary padding, but state that
> readers should be tolerant of it given there are writers in the wild
> producing such data?

+1

On Wed, Apr 15, 2026 at 9:09 AM Raphael Taylor-Davies <
[email protected]> wrote:

> Hi,
>
> We ran into a related issue to this in arrow-rs when implementing late
> materialization [1]. As you point out the fact that the encoding stores
> the length of a run / 8 means decoders already need additional
> information from the page metadata to correctly interpret RLE data
> anyway, having additional padding doesn't introduce any additional
> ambiguity beyond this.
>
> I definitely think this quirk of the encoding should be called out more
> clearly in the specification. With regards to writing additional
> padding, perhaps we could appeal to the robustness principle and
> discourage writers from producing unnecessary padding, but state that
> readers should be tolerant of it given there are writers in the wild
> producing such data?
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> [1]: https://github.com/apache/arrow-rs/pull/4319
>
> On 15/04/2026 15:53, Jan Finis wrote:
> > Hey folks,
> >
> > we have found that DuckDB produces "too long" bitpacks at the end of an
> RLE
> > encoded page. That is, it creates a bitpack of length 256 even though
> there
> > are just 14 entries in the page. See here for details (github login
> > required): https://github.com/duckdb/duckdb/discussions/22089
> >
> > Our engine currently rejects reading this and flags it as corrupted. But
> > the question is now: Who's in the right here? Is this legal? E.g., is it
> > okay for a page with 5 entries to have an RLE bitpack with 256 entries?
> >
> > We read the spec multiple times but it seems that it's underspecified. It
> > is somewhat implicitly clear that you are at least allowed to pad the run
> > at the end of a page to the next multiple of 8, as bitpack runs length
> > *can* only be a multiple of 8, so a page with 5 entries has no way of
> > encoding these in a bit pack of the correct size (but that is also
> implicit
> > and not mentioned). But is it okay to go larger than this? E.g., 256 as
> in
> > the case of DuckDB.
> >
> > The spec mandates that pages write their contents (R-levels, D-levels,
> > Data) behind each other without padding. But nothing says whether the
> > number of encoded entries may be larger than the actual number of entries
> > in the page or whether runs may be larger than required at the end of a
> > page.
> >
> > There are some sentences on the delta encoding, but no similar sentences
> > for RLE:
> >
> > If there are not enough values to fill the last miniblock, we pad the
> >> miniblock so that its length is always the number of values in a full
> >> miniblock multiplied by the bit width. The values of the padding bits
> >> should be zero, but readers must accept paddings consisting of arbitrary
> >> bits as well.
> >
> > So, what's the spec-lawyer answer here? And depending on what that answer
> > is, should we update the Parquet spec to clarify this case; it seems to
> be
> > truely underspecified.
> >
> > Cheers,
> > Jan
> >
>

Reply via email to