Hi All,
Recently it was reported that many of the arrow parquet readers,
including arrow-cpp, pyarrow and arrow-rs, do not support GZIP
compressed pages containing multiple members [3]. It would also appear
other parquet implementations such as DuckDB have similar issues [4].
This in turn led to some discussion as to whether this was permissible
according to the parquet specification [5], with the proposed compromise
to explicitly state that multiple members should be supported by
readers, but to recommend writers don't produce such pages by default
given the non-trivial install base where this will cause issues
including silent data corruption. I have tried to encode this in [6],
and welcome any feedback.
Kind Regards,
Raphael Taylor-Davies
[1]: https://github.com/apache/arrow/pull/38272
[2]: https://github.com/apache/arrow-rs/pull/4951
[3]: https://datatracker.ietf.org/doc/html/rfc1952
[4]:
https://github.com/apache/parquet-testing/pull/41#issuecomment-1770410715
[5]: https://github.com/apache/parquet-testing/pull/41
[6]: https://github.com/apache/parquet-format/pull/218