>
> For LZ4 we might choose the 2nd option so the user can still use it if it
> understand the risks (e.g. using only java-based components so it should
> work fine). After we have the new properly specified LZ4 available in
> parquet-format and in the implementations we should completely disable (3rd
> option) the write path of the original LZ4.

+1, I think this is the most user-friendly approach.

 I agree that we need more comprehensive integration testing. (Created a
> jira about my ideas: PARQUET-1985
> <https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also
> turned out that for features like compression codecs the specification is
> the key. We might always have specific data that we did not include in the
> integration tests and still fails at runtime.

+1, Both specification and integration tests are important.  I think they
complement each other quite well.

On Wed, Feb 17, 2021 at 2:18 AM Gabor Szadovszky <ga...@apache.org> wrote:

> When we are talking about deprecating an already released feature (this
> time LZ4) we shall always support the read path for the existing files. I
> think this is a fundamental requirement for every file format. The
> deprecation is more about the write path which has different levels. E.g.
> we don't suggest users to use such features, we disable the related feature
> by default with the option for the user to enable it or we disable the
> feature so the user is not able to use anymore.
>
> For LZ4 we might choose the 2nd option so the user can still use it if it
> understand the risks (e.g. using only java-based components so it should
> work fine). After we have the new properly specified LZ4 available in
> parquet-format and in the implementations we should completely disable (3rd
> option) the write path of the original LZ4.
>
> I agree that we need more comprehensive integration testing. (Created a
> jira about my ideas: PARQUET-1985
> <https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also
> turned out that for features like compression codecs the specification is
> the key. We might always have specific data that we did not include in the
> integration tests and still fails at runtime.
>
> On Wed, Feb 17, 2021 at 4:59 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > I think it would be a mistake for someone who has written Hadoop-Lz4
> for
> > > several years with parquet-mr to all of sudden be no longer able to
> read
> > > their files. (I believe that parquet-mr with this pattern has been
> > > incorporated into various libraries for several years now--correct me
> if
> > > I'm wrong.)
> >
> > I think this aligns with option #4 proposed above?
> >
> > It seems simply by usage #2 (more reverse engineering on the C++ side) or
> > #4 would be the way to go.  I'll admit I am also not currently motivated
> to
> > reverse engineer this at the moment.  I think on the C++ side at least we
> > should disable writing to avoid further fragmentation.   It also isn't
> > clear to me if we have solid data to suggest the other compression
> formats
> > are also not broken.
> >
> > I think this has come up before but more comprehensive integration
> testing
> > is key here (there was another recent bug [1] where some of the
> pyarrow/C++
> > written files aren't readable by some java libraries, so I think there
> are
> > potentially more important issues we need to deal with).
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-11629
> >
> > On Tue, Feb 16, 2021 at 10:09 AM Jacques Nadeau <jacq...@apache.org>
> > wrote:
> >
> > > There is some ambiguity in the discussion and proposals here around
> > > deprecating future writing versus supporting reading of already written
> > > data and what it means to deprecate something in the format
> > specification.
> > >
> > > I think it would be a mistake for someone who has written Hadoop-Lz4
> for
> > > several years with parquet-mr to all of sudden be no longer able to
> read
> > > their files. (I believe that parquet-mr with this pattern has been
> > > incorporated into various libraries for several years now--correct me
> if
> > > I'm wrong.)
> > >
> > > On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > > > Thank you for the detailed summary of the LZ4 situation, Antoine!
> > > >
> > > > The Parquet file format should be properly specified for every
> > > > implementation. It was the mistake of the parquet-mr developers that
> we
> > > > thought the Hadoop implementation of LZ4 is according to the LZ4
> > > > specification and the fault of the Parquet community that we extended
> > the
> > > > parquet-format with the LZ4 option without checking that there are 2
> > > > options to choose (not talking about the 3rd Hadoop one).
> > > >
> > > > I agree that option 4 is the only way we can step forward from this
> > > > situation. We shall deprecate LZ4 in parquet-format and in the mean
> > time
> > > we
> > > > should agree on which officially specified LZ4 format do we want to
> > > > introduce.
> > > >
> > > > Of course, we may try to improve compatibility with the existing LZ4
> > > files
> > > > but we should not encourage our users to use this compression for
> now.
> > > >
> > > > Because we already have parquet-mr releases that uses the
> > under-specified
> > > > Haddop LZ4 codec I don't feel it is that urgent to block the current
> > > > parquet-mr release because of this. We shall update parquet-format to
> > > make
> > > > the LZ4 situation clear then create a release. Then, we can start
> > working
> > > > on deprecating/blocking the write path of the current implementation
> > and
> > > > implement the properly specified LZ4 support in all the
> > implementations.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > This is a proposal to deprecate and/or remove LZ4 compression from
> > the
> > > > > Parquet specification.
> > > > >
> > > > > Abstract
> > > > > --------
> > > > >
> > > > > Despite several attempts by the parquet-cpp developers, we were not
> > > > > able to reach the point where LZ4-compressed Parquet files are
> > > > > bidirectionally compatible between parquet-cpp and parquet-mr.
> Other
> > > > > implementations are having, or have had, similar issues.  My
> > conclusion
> > > > > is that the Parquet spec doesn't allow independent reimplementation
> > of
> > > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4
> > > > > compression should be removed from the spec (possibly replaced with
> > > > > another enum value for a properly-specified, interoperable,
> > LZ4-backed
> > > > > compression scheme).
> > > > >
> > > > > About LZ4
> > > > > ---------
> > > > >
> > > > > LZ4 is an extremely fast compression format and library.
> > Decompression
> > > > > speeds of 4GB/s can routinely be achieved in pure software, with
> > > > > compression speeds around 800 MB/s. Compression ratios are close to
> > > > > those achieved by Snappy and LZO, but at vastly higher speeds.
> > > > >
> > > > > Two formats are officially defined by the LZ4 project: the block
> > format
> > > > > and the frame format.  The frame format, as the name suggests, is
> > > > > higher-level and contains all the information required to
> decompress
> > > > > arbitrary data buffers.  The block format is to be used when such
> > > > > information is made available separately (for example as part of
> > > > > ancillary metadata, protocol headers, etc.).
> > > > >
> > > > > Core issue
> > > > > ----------
> > > > >
> > > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr,
> which
> > > > > seems to be the point of reference to look to when implementing the
> > > > > Parquet format) uses a home-baked framing around LZ4 block
> > compression
> > > > > that's not specified anywhere. The only way to get information
> about
> > it
> > > > > is to read the Hadoop source code.
> > > > >
> > > > > Being unspecified, it also doesn't say if there are additional
> > > > > constraints on the parameters (e.g. frame size).  Such constraints
> > will
> > > > > be implementation-defined, undocumented, and only discoverable
> > through
> > > > > tedious testing and iteration, with no guarantee of ever achieving
> > > > > 100% compatibility.
> > > > >
> > > > > Note that LZ4 compression itself officially comes in two formats: a
> > > > > low-level block format, a higher-level framed format.  But
> parquet-mr
> > > > > uses a third, custom framing format that's not part of the LZ4
> > format.
> > > > >
> > > > > History of compatibility issues
> > > > > -------------------------------
> > > > >
> > > > > 1)
> > > > >
> > > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in
> > the
> > > > > Parquet spec meant the LZ4 block format.  After all, information
> > > > > about data size is already available in the Parquet metadata, so no
> > > > > specific framing is theoretically required around the compressed
> > data.
> > > > > However, it quickly occurred that this interpretation was
> > incompatible
> > > > > with files produced by parquet-mr (and vice-versa: files produced
> by
> > > > > parquet-cpp could not be read with parquet-mr).
> > > > >
> > > > > A first issue was posted to suggest switching the Parquet spec (and
> > > > > parquet-mr) to the LZ4 framed format:
> > > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > >
> > > > > However, this didn't come to a resolution, because people didn't
> want
> > > > > to break compatibility with previous versions of parquet-mr (note
> > that
> > > > > this concern would necessarily switch the burden of compatibility
> > > > > breakage onto other implempentations).
> > > > >
> > > > > Relatedly, there is an issue open for Hadoop, which also didn't
> get a
> > > > > resolution:
> > > > > https://issues.apache.org/jira/browse/HADOOP-12990
> > > > >
> > > > > 2)
> > > > >
> > > > > To avoid making things worse, parquet-cpp developers then decided
> to
> > > > > (temporarily?) disable writing LZ4 files from C++:
> > > > > https://issues.apache.org/jira/browse/ARROW-9424
> > > > >
> > > > > (note that this is parquet-cpp deliberately crippling its own
> feature
> > > > > set in order to workaround an issue created by an undocumented
> format
> > > > > implemented in parquet-mr)
> > > > >
> > > > > At this point, though, the LZ4 *reader* in parquet-cpp could still
> > not
> > > > > read files produced by parquet-mr.  So, besides being frustrating
> for
> > > > > C++ users (and users of bindings to the C++ library, e.g. Python,
> > > > > Ruby...), this decision did also not make interoperability better
> in
> > > > > the Java -> C++ direction.
> > > > >
> > > > > 3)
> > > > >
> > > > > parquet-cpp developers decided to implement a format detection so
> as
> > to
> > > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced
> by
> > > > > previous versions of parquet-cpp.
> > > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > > >
> > > > > In addition, the write path was reenabled, this time producing
> files
> > > > > that (should!) conform to the parquet-mr implementation of LZ4
> > > > > compression.
> > > > >
> > > > > This seemed to work fine.
> > > > >
> > > > > 4)
> > > > >
> > > > > While PARQUET-1878 (see above) seemed to work fine in basic tests,
> it
> > > > > was later reported that some files did not read properly.
> > > > >
> > > > > Indeed, our (parquet-cpp) compatibility code assumed that a single
> > > > > "parquet-mr LZ4 frame" was ever written out for a single data page.
> > > > > However, it turns out that parquet-mr can produce several such
> frames
> > > > > for larger data pages (it seems to frame at around 128 kiB
> > boundaries).
> > > > >
> > > > > While this seems of course reasonable, the format being
> undocumented
> > > > > prevented us from foreseeing this situation.  Once the issue
> > diagnosed,
> > > > > we (parquet-cpp developers) pushed a fix for it:
> > > > > https://issues.apache.org/jira/browse/ARROW-11301
> > > > >
> > > > > 5)
> > > > >
> > > > > We were happy after this later fix... only for a short time.  Now
> the
> > > > > reverse problem happened: some LZ4 files produced by parquet-cpp
> > cannot
> > > > > be read by parquet-mr:
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168
> > > > >
> > > > > I decided that this couldn't be a parquet-cpp issue anymore.
> > Evidently,
> > > > > the fact that parquet-mr uses an unspecified framing format with
> > > > > unspecified constraints and limits prevents other implementations
> > from
> > > > > being compatible.
> > > > >
> > > > > So I asked the reporter to open a parquet-mr issue instead, and
> then
> > > > > took the liberty of marking the issue as blocker:
> > > > > https://issues.apache.org/jira/browse/PARQUET-1974
> > > > >
> > > > > 6)
> > > > >
> > > > > Unsurprisingly, other Parquet implementations are also suffering
> from
> > > > > this.
> > > > >
> > > > > * fastparquet (a Parquet implementation written partly in Python)
> had
> > > > > initially used the LZ4 block format, then switched to the LZ4 frame
> > > > > format to achieve compatibility with parquet-cpp (since both are
> used
> > > > > in the Python ecosystem):
> > > > > https://github.com/dask/fastparquet/issues/314
> > > > >
> > > > > Of course, fastparquet probably won't be able to read or write
> files
> > > > > compatible with parquet-mr...
> > > > >
> > > > > * The Rust Parquet implementation that's part of the Rust Arrow
> > > > >   implementation are also having compatibility issues due to making
> > the
> > > > >   wrong guess:
> > > > > https://issues.apache.org/jira/browse/ARROW-11381
> > > > >
> > > > > * Impala seems to have had to change their LZ4 implementation to
> also
> > > > >   match the undocumented parquet-mr framing format
> > > > > https://issues.apache.org/jira/browse/IMPALA-8617
> > > > >
> > > > > (I have no idea whether this solved all LZ4 problems for Impala,
> and
> > > > > whether it broke compatibility with previous Impala versions)
> > > > >
> > > > > Possible solutions
> > > > > ------------------
> > > > >
> > > > > 1) Do nothing.  The Parquet ecosystem is forever fragmented, even
> > > > > though the official Parquet documentation doesn't say so.
> > > > >
> > > > > 2) parquet-cpp finally achieves compatibility through additional
> > > > > reverse-engineering efforts.  While I have no authority to prevent
> > > > > this, I also have zero personal motivation to achieve it anymore.
> I
> > > > > suspect other parquet-cpp developers are not significantly more
> > > > > motivated than I am (but I may be wrong).
> > > > >
> > > > > 3) parquet-mr takes its share of the effort by striving to be
> > > > > compatible with files produced by (the current iteration of)
> > > > > parquet-cpp. I have no idea whether this is easily doable, and
> > whether
> > > > > there is any motivation or workforce to do it. Therefore, I will
> > > > > conservatively rate this as "unlikely in the short term".
> > > > >
> > > > > 4) Admitting that LZ4 compatibility currently is not an achievable
> > > > > goal, the LZ4 compression option is deprecated and removed from the
> > > > > Parquet spec.  Implementations display a warning message when it is
> > > > > being read.  Writing LZ4 files is disabled.
> > > > >
> > > > > Discussion
> > > > > ----------
> > > > >
> > > > > Solution #1 seems evidently undesirable to me (though it also seems
> > > > > that nobody on the Java side was tremendously impatient to solve
> this
> > > > > issue).
> > > > >
> > > > > As I said, solution #2 is rather unlikely.
> > > > >
> > > > > Java developers would have to evaluate how likely solution #3 is.
> We
> > > > > should avoid depending on promises that nobody is really willing to
> > > > > hold, though.  Please only say "we're going to solve this" if
> you're
> > > > > really willing to push it through.
> > > > >
> > > > > In any case, even if solution #2 or #3 were to be implemented, it
> > would
> > > > > not necessarily help future implementations, as long as the format
> is
> > > > > still unspecified.
> > > > >
> > > > > In the end, while a bit frustrating for everyone, solution #4 looks
> > > like
> > > > > the only reasonable one for the time being.
> > > > >
> > > > > Future LZ4 format
> > > > > -----------------
> > > > >
> > > > > While we should deprecate the current LZ4 format, this doesn't
> > preclude
> > > > > to add another, this time well-specified, LZ4 format in the future.
> > It
> > > > > would simply have to use a new `CompressionCodec` enum value in the
> > > > > Thrift definition.
> > > > >
> > > > > Most certainly, it should be either of the two formats officially
> > > > > defined by the LZ4 project (the block format or the frame format).
> > > > >
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to