Hey folks, we have found that DuckDB produces "too long" bitpacks at the end of an RLE encoded page. That is, it creates a bitpack of length 256 even though there are just 14 entries in the page. See here for details (github login required): https://github.com/duckdb/duckdb/discussions/22089
Our engine currently rejects reading this and flags it as corrupted. But the question is now: Who's in the right here? Is this legal? E.g., is it okay for a page with 5 entries to have an RLE bitpack with 256 entries? We read the spec multiple times but it seems that it's underspecified. It is somewhat implicitly clear that you are at least allowed to pad the run at the end of a page to the next multiple of 8, as bitpack runs length *can* only be a multiple of 8, so a page with 5 entries has no way of encoding these in a bit pack of the correct size (but that is also implicit and not mentioned). But is it okay to go larger than this? E.g., 256 as in the case of DuckDB. The spec mandates that pages write their contents (R-levels, D-levels, Data) behind each other without padding. But nothing says whether the number of encoded entries may be larger than the actual number of entries in the page or whether runs may be larger than required at the end of a page. There are some sentences on the delta encoding, but no similar sentences for RLE: If there are not enough values to fill the last miniblock, we pad the > miniblock so that its length is always the number of values in a full > miniblock multiplied by the bit width. The values of the padding bits > should be zero, but readers must accept paddings consisting of arbitrary > bits as well. So, what's the spec-lawyer answer here? And depending on what that answer is, should we update the Parquet spec to clarify this case; it seems to be truely underspecified. Cheers, Jan
