Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-13 Thread Antoine Pitrou
Agreed, but even then, if some Parquet files are generated inside of a well-defined system which only needs to be interoperable with itself, it's not necessaril harmful to allow LZ4 compression when writing new files. Regards Antoine. Le 13/07/2020 à 17:07, Wes McKinney a écrit : > I didn’t s

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-13 Thread Krisztián Szűcs
On Mon, Jul 13, 2020 at 11:15 AM Antoine Pitrou wrote: > > > I'm not sure that's a good idea. There are probably Parquet files that > are only ever used with the Arrow implementation (Arrow C++, Arrow > Python, Arrow R...). I tend to agree with Antoine here. As an alternative to disabling the co

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-13 Thread Wes McKinney
I didn’t say to disable _reading_ them, only writing them. On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou wrote: > > I'm not sure that's a good idea. There are probably Parquet files that > are only ever used with the Arrow implementation (Arrow C++, Arrow > Python, Arrow R...). > > I admit I'm

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-13 Thread Patrick Pai
I'll volunteer to disable writing/reading LZ4. I'll submit a patch in the next few days. On 2020/07/12 22:11:33, Wes McKinney wrote: > Since there hasn't been other movement on this, we need to disable > writing LZ4-compressed files until this can be investigated more > thoroughly. If someone w

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-13 Thread Antoine Pitrou
I'm not sure that's a good idea. There are probably Parquet files that are only ever used with the Arrow implementation (Arrow C++, Arrow Python, Arrow R...). I admit I'm also not terribly bothered about this, since the Parquet community itself doesn't seem to care much about the issue (it has

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-12 Thread Wes McKinney
Since there hasn't been other movement on this, we need to disable writing LZ4-compressed files until this can be investigated more thoroughly. If someone wants to submit a patch that would be helpful otherwise I can take a look in the next couple days On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitro

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou wrote: > > > Le 06/07/2020 à 17:57, Steve Kim a écrit : > > The Parquet format specification is ambiguous about the exact details of > > LZ4 compression. However, the *de facto* reference implementation in Java > > (parquet-mr) uses the Hadoop LZ4 cod

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++? Changing the lz4 implementation to be compatible with parquet-mr/hadoop would break compatibility with any existing files that were written by Parquet C++ using lz4 compression. I believe that it is not possible to reliab

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Antoine Pitrou
Le 06/07/2020 à 17:57, Steve Kim a écrit : > The Parquet format specification is ambiguous about the exact details of > LZ4 compression. However, the *de facto* reference implementation in Java > (parquet-mr) uses the Hadoop LZ4 codec. > > I think that it is important for Parquet c++ to have com

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec. I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when poss

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Antoine Pitrou
Well, it depends how important speed is, but LZ4 has extremely fast decompression, even compared to Snappy: https://github.com/lz4/lz4#benchmarks Regards Antoine. Le 02/07/2020 à 19:47, Christian Hudon a écrit : > At least for us, the advantages of Parquet are speed and interoperability > in

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Christian Hudon
At least for us, the advantages of Parquet are speed and interoperability in the context of longer-term data storage, so I would tend to say "reasonably conservative". Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou a écrit : > > I don't have a sense of how conservative Parquet users generally

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-01 Thread Antoine Pitrou
I don't have a sense of how conservative Parquet users generally are. Is it worth adding a LZ4_FRAMED compression option in the Parquet format, or would people just not use it? Regards Antoine. On Tue, 30 Jun 2020 14:33:17 +0200 "Uwe L. Korn" wrote: > I'm also in favor of disabling support f

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-30 Thread Uwe L. Korn
I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since th

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-29 Thread Wes McKinney
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou wrote: > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit : > > hi folks, > > > > (cross-posting to dev@arrow and dev@parquet since there are > > stakeholders in both places) > > > > It seems there are still problems at least with the C++ implementatio

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-25 Thread Antoine Pitrou
Le 25/06/2020 à 00:02, Wes McKinney a écrit : > hi folks, > > (cross-posting to dev@arrow and dev@parquet since there are > stakeholders in both places) > > It seems there are still problems at least with the C++ implementation > of LZ4 compression in Parquet files > > https://issues.apache.or

[DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-24 Thread Wes McKinney
hi folks, (cross-posting to dev@arrow and dev@parquet since there are stakeholders in both places) It seems there are still problems at least with the C++ implementation of LZ4 compression in Parquet files https://issues.apache.org/jira/browse/PARQUET-1241 https://issues.apache.org/jira/browse/P