Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Wes McKinney Mon, 06 Jul 2020 15:25:06 -0700

On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 06/07/2020 à 17:57, Steve Kim a écrit :
> > The Parquet format specification is ambiguous about the exact details of
> > LZ4 compression. However, the *de facto* reference implementation in Java
> > (parquet-mr) uses the Hadoop LZ4 codec.
> >
> > I think that it is important for Parquet c++ to have compatibility and
> > feature parity with parquet-mr when possible. I prefer to change the
> > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> > that is used by parquet-mr (
> > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> > change will be quick and easy. I have an intern under my supervision who is
> > available to work on it full time, starting immediately. Please let me know
> > if we ought to proceed.
>
> Would that keep compatibility with existing files produces by Parquet C++?


Given that LZ4 has been constantly broken in C++ (first using the raw
format, then the block format -- still incompatible apparently) I
think we would recommend that in the rare event that people have
LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used
mostly) they should rewrite their files with a different codec using
e.g. pyarrow 0.17.1

> Regards
>
> Antoine.

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Reply via email to