> > I definitely think this quirk of the encoding should be called out more > clearly in the specification. With regards to writing additional > padding, perhaps we could appeal to the robustness principle and > discourage writers from producing unnecessary padding, but state that > readers should be tolerant of it given there are writers in the wild > producing such data?
+1 On Wed, Apr 15, 2026 at 9:09 AM Raphael Taylor-Davies < [email protected]> wrote: > Hi, > > We ran into a related issue to this in arrow-rs when implementing late > materialization [1]. As you point out the fact that the encoding stores > the length of a run / 8 means decoders already need additional > information from the page metadata to correctly interpret RLE data > anyway, having additional padding doesn't introduce any additional > ambiguity beyond this. > > I definitely think this quirk of the encoding should be called out more > clearly in the specification. With regards to writing additional > padding, perhaps we could appeal to the robustness principle and > discourage writers from producing unnecessary padding, but state that > readers should be tolerant of it given there are writers in the wild > producing such data? > > Kind Regards, > > Raphael Taylor-Davies > > [1]: https://github.com/apache/arrow-rs/pull/4319 > > On 15/04/2026 15:53, Jan Finis wrote: > > Hey folks, > > > > we have found that DuckDB produces "too long" bitpacks at the end of an > RLE > > encoded page. That is, it creates a bitpack of length 256 even though > there > > are just 14 entries in the page. See here for details (github login > > required): https://github.com/duckdb/duckdb/discussions/22089 > > > > Our engine currently rejects reading this and flags it as corrupted. But > > the question is now: Who's in the right here? Is this legal? E.g., is it > > okay for a page with 5 entries to have an RLE bitpack with 256 entries? > > > > We read the spec multiple times but it seems that it's underspecified. It > > is somewhat implicitly clear that you are at least allowed to pad the run > > at the end of a page to the next multiple of 8, as bitpack runs length > > *can* only be a multiple of 8, so a page with 5 entries has no way of > > encoding these in a bit pack of the correct size (but that is also > implicit > > and not mentioned). But is it okay to go larger than this? E.g., 256 as > in > > the case of DuckDB. > > > > The spec mandates that pages write their contents (R-levels, D-levels, > > Data) behind each other without padding. But nothing says whether the > > number of encoded entries may be larger than the actual number of entries > > in the page or whether runs may be larger than required at the end of a > > page. > > > > There are some sentences on the delta encoding, but no similar sentences > > for RLE: > > > > If there are not enough values to fill the last miniblock, we pad the > >> miniblock so that its length is always the number of values in a full > >> miniblock multiplied by the bit width. The values of the padding bits > >> should be zero, but readers must accept paddings consisting of arbitrary > >> bits as well. > > > > So, what's the spec-lawyer answer here? And depending on what that answer > > is, should we update the Parquet spec to clarify this case; it seems to > be > > truely underspecified. > > > > Cheers, > > Jan > > >
