Hi All,

Recently it was reported that many of the arrow parquet readers, including arrow-cpp, pyarrow and arrow-rs, do not support GZIP compressed pages containing multiple members [3]. It would also appear other parquet implementations such as DuckDB have similar issues [4]. This in turn led to some discussion as to whether this was permissible according to the parquet specification [5], with the proposed compromise to explicitly state that multiple members should be supported by readers, but to recommend writers don't produce such pages by default given the non-trivial install base where this will cause issues including silent data corruption. I have tried to encode this in [6], and welcome any feedback.

Kind Regards,

Raphael Taylor-Davies

[1]: https://github.com/apache/arrow/pull/38272
[2]: https://github.com/apache/arrow-rs/pull/4951
[3]: https://datatracker.ietf.org/doc/html/rfc1952
[4]: https://github.com/apache/parquet-testing/pull/41#issuecomment-1770410715
[5]: https://github.com/apache/parquet-testing/pull/41
[6]: https://github.com/apache/parquet-format/pull/218

Reply via email to