There is some ambiguity in the discussion and proposals here around
deprecating future writing versus supporting reading of already written
data and what it means to deprecate something in the format specification.

I think it would be a mistake for someone who has written Hadoop-Lz4 for
several years with parquet-mr to all of sudden be no longer able to read
their files. (I believe that parquet-mr with this pattern has been
incorporated into various libraries for several years now--correct me if
I'm wrong.)

On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Thank you for the detailed summary of the LZ4 situation, Antoine!
>
> The Parquet file format should be properly specified for every
> implementation. It was the mistake of the parquet-mr developers that we
> thought the Hadoop implementation of LZ4 is according to the LZ4
> specification and the fault of the Parquet community that we extended the
> parquet-format with the LZ4 option without checking that there are 2
> options to choose (not talking about the 3rd Hadoop one).
>
> I agree that option 4 is the only way we can step forward from this
> situation. We shall deprecate LZ4 in parquet-format and in the mean time we
> should agree on which officially specified LZ4 format do we want to
> introduce.
>
> Of course, we may try to improve compatibility with the existing LZ4 files
> but we should not encourage our users to use this compression for now.
>
> Because we already have parquet-mr releases that uses the under-specified
> Haddop LZ4 codec I don't feel it is that urgent to block the current
> parquet-mr release because of this. We shall update parquet-format to make
> the LZ4 situation clear then create a release. Then, we can start working
> on deprecating/blocking the write path of the current implementation and
> implement the properly specified LZ4 support in all the implementations.
>
> Regards,
> Gabor
>
> On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello,
> >
> > This is a proposal to deprecate and/or remove LZ4 compression from the
> > Parquet specification.
> >
> > Abstract
> > --------
> >
> > Despite several attempts by the parquet-cpp developers, we were not
> > able to reach the point where LZ4-compressed Parquet files are
> > bidirectionally compatible between parquet-cpp and parquet-mr. Other
> > implementations are having, or have had, similar issues.  My conclusion
> > is that the Parquet spec doesn't allow independent reimplementation of
> > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > compression should be removed from the spec (possibly replaced with
> > another enum value for a properly-specified, interoperable, LZ4-backed
> > compression scheme).
> >
> > About LZ4
> > ---------
> >
> > LZ4 is an extremely fast compression format and library.  Decompression
> > speeds of 4GB/s can routinely be achieved in pure software, with
> > compression speeds around 800 MB/s. Compression ratios are close to
> > those achieved by Snappy and LZO, but at vastly higher speeds.
> >
> > Two formats are officially defined by the LZ4 project: the block format
> > and the frame format.  The frame format, as the name suggests, is
> > higher-level and contains all the information required to decompress
> > arbitrary data buffers.  The block format is to be used when such
> > information is made available separately (for example as part of
> > ancillary metadata, protocol headers, etc.).
> >
> > Core issue
> > ----------
> >
> > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
> > seems to be the point of reference to look to when implementing the
> > Parquet format) uses a home-baked framing around LZ4 block compression
> > that's not specified anywhere. The only way to get information about it
> > is to read the Hadoop source code.
> >
> > Being unspecified, it also doesn't say if there are additional
> > constraints on the parameters (e.g. frame size).  Such constraints will
> > be implementation-defined, undocumented, and only discoverable through
> > tedious testing and iteration, with no guarantee of ever achieving
> > 100% compatibility.
> >
> > Note that LZ4 compression itself officially comes in two formats: a
> > low-level block format, a higher-level framed format.  But parquet-mr
> > uses a third, custom framing format that's not part of the LZ4 format.
> >
> > History of compatibility issues
> > -------------------------------
> >
> > 1)
> >
> > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
> > Parquet spec meant the LZ4 block format.  After all, information
> > about data size is already available in the Parquet metadata, so no
> > specific framing is theoretically required around the compressed data.
> > However, it quickly occurred that this interpretation was incompatible
> > with files produced by parquet-mr (and vice-versa: files produced by
> > parquet-cpp could not be read with parquet-mr).
> >
> > A first issue was posted to suggest switching the Parquet spec (and
> > parquet-mr) to the LZ4 framed format:
> > https://issues.apache.org/jira/browse/PARQUET-1241
> >
> > However, this didn't come to a resolution, because people didn't want
> > to break compatibility with previous versions of parquet-mr (note that
> > this concern would necessarily switch the burden of compatibility
> > breakage onto other implempentations).
> >
> > Relatedly, there is an issue open for Hadoop, which also didn't get a
> > resolution:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > 2)
> >
> > To avoid making things worse, parquet-cpp developers then decided to
> > (temporarily?) disable writing LZ4 files from C++:
> > https://issues.apache.org/jira/browse/ARROW-9424
> >
> > (note that this is parquet-cpp deliberately crippling its own feature
> > set in order to workaround an issue created by an undocumented format
> > implemented in parquet-mr)
> >
> > At this point, though, the LZ4 *reader* in parquet-cpp could still not
> > read files produced by parquet-mr.  So, besides being frustrating for
> > C++ users (and users of bindings to the C++ library, e.g. Python,
> > Ruby...), this decision did also not make interoperability better in
> > the Java -> C++ direction.
> >
> > 3)
> >
> > parquet-cpp developers decided to implement a format detection so as to
> > read LZ4 files produced by parquet-mr, but also LZ4 files produced by
> > previous versions of parquet-cpp.
> > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > In addition, the write path was reenabled, this time producing files
> > that (should!) conform to the parquet-mr implementation of LZ4
> > compression.
> >
> > This seemed to work fine.
> >
> > 4)
> >
> > While PARQUET-1878 (see above) seemed to work fine in basic tests, it
> > was later reported that some files did not read properly.
> >
> > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > However, it turns out that parquet-mr can produce several such frames
> > for larger data pages (it seems to frame at around 128 kiB boundaries).
> >
> > While this seems of course reasonable, the format being undocumented
> > prevented us from foreseeing this situation.  Once the issue diagnosed,
> > we (parquet-cpp developers) pushed a fix for it:
> > https://issues.apache.org/jira/browse/ARROW-11301
> >
> > 5)
> >
> > We were happy after this later fix... only for a short time.  Now the
> > reverse problem happened: some LZ4 files produced by parquet-cpp cannot
> > be read by parquet-mr:
> >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> >
> > I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
> > the fact that parquet-mr uses an unspecified framing format with
> > unspecified constraints and limits prevents other implementations from
> > being compatible.
> >
> > So I asked the reporter to open a parquet-mr issue instead, and then
> > took the liberty of marking the issue as blocker:
> > https://issues.apache.org/jira/browse/PARQUET-1974
> >
> > 6)
> >
> > Unsurprisingly, other Parquet implementations are also suffering from
> > this.
> >
> > * fastparquet (a Parquet implementation written partly in Python) had
> > initially used the LZ4 block format, then switched to the LZ4 frame
> > format to achieve compatibility with parquet-cpp (since both are used
> > in the Python ecosystem):
> > https://github.com/dask/fastparquet/issues/314
> >
> > Of course, fastparquet probably won't be able to read or write files
> > compatible with parquet-mr...
> >
> > * The Rust Parquet implementation that's part of the Rust Arrow
> >   implementation are also having compatibility issues due to making the
> >   wrong guess:
> > https://issues.apache.org/jira/browse/ARROW-11381
> >
> > * Impala seems to have had to change their LZ4 implementation to also
> >   match the undocumented parquet-mr framing format
> > https://issues.apache.org/jira/browse/IMPALA-8617
> >
> > (I have no idea whether this solved all LZ4 problems for Impala, and
> > whether it broke compatibility with previous Impala versions)
> >
> > Possible solutions
> > ------------------
> >
> > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > though the official Parquet documentation doesn't say so.
> >
> > 2) parquet-cpp finally achieves compatibility through additional
> > reverse-engineering efforts.  While I have no authority to prevent
> > this, I also have zero personal motivation to achieve it anymore.  I
> > suspect other parquet-cpp developers are not significantly more
> > motivated than I am (but I may be wrong).
> >
> > 3) parquet-mr takes its share of the effort by striving to be
> > compatible with files produced by (the current iteration of)
> > parquet-cpp. I have no idea whether this is easily doable, and whether
> > there is any motivation or workforce to do it. Therefore, I will
> > conservatively rate this as "unlikely in the short term".
> >
> > 4) Admitting that LZ4 compatibility currently is not an achievable
> > goal, the LZ4 compression option is deprecated and removed from the
> > Parquet spec.  Implementations display a warning message when it is
> > being read.  Writing LZ4 files is disabled.
> >
> > Discussion
> > ----------
> >
> > Solution #1 seems evidently undesirable to me (though it also seems
> > that nobody on the Java side was tremendously impatient to solve this
> > issue).
> >
> > As I said, solution #2 is rather unlikely.
> >
> > Java developers would have to evaluate how likely solution #3 is.  We
> > should avoid depending on promises that nobody is really willing to
> > hold, though.  Please only say "we're going to solve this" if you're
> > really willing to push it through.
> >
> > In any case, even if solution #2 or #3 were to be implemented, it would
> > not necessarily help future implementations, as long as the format is
> > still unspecified.
> >
> > In the end, while a bit frustrating for everyone, solution #4 looks like
> > the only reasonable one for the time being.
> >
> > Future LZ4 format
> > -----------------
> >
> > While we should deprecate the current LZ4 format, this doesn't preclude
> > to add another, this time well-specified, LZ4 format in the future.  It
> > would simply have to use a new `CompressionCodec` enum value in the
> > Thrift definition.
> >
> > Most certainly, it should be either of the two formats officially
> > defined by the LZ4 project (the block format or the frame format).
> >
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Reply via email to