On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou <[email protected]> wrote: > > > Le 06/07/2020 à 17:57, Steve Kim a écrit : > > The Parquet format specification is ambiguous about the exact details of > > LZ4 compression. However, the *de facto* reference implementation in Java > > (parquet-mr) uses the Hadoop LZ4 codec. > > > > I think that it is important for Parquet c++ to have compatibility and > > feature parity with parquet-mr when possible. I prefer to change the > > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation > > that is used by parquet-mr ( > > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this > > change will be quick and easy. I have an intern under my supervision who is > > available to work on it full time, starting immediately. Please let me know > > if we ought to proceed. > > Would that keep compatibility with existing files produces by Parquet C++?
Given that LZ4 has been constantly broken in C++ (first using the raw format, then the block format -- still incompatible apparently) I think we would recommend that in the rare event that people have LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used mostly) they should rewrite their files with a different codec using e.g. pyarrow 0.17.1 > Regards > > Antoine.
