Hi Gabor,
I've filed https://issues.apache.org/jira/browse/PARQUET-1996 and submitted https://github.com/apache/parquet-format/pull/168 for the format updates. Hopefully this will get the ball rolling. Best regards Antoine. On Mon, 8 Mar 2021 10:56:32 +0100 Gabor Szadovszky <ga...@apache.org> wrote: > Hi Antoine, > > We might do anything in parquet-format, without releasing and adapting it, > it would mean nothing. Also, creating a new parquet-format release only for > deprecating LZ4 makes little sense to me. So, I would suggest adding a > simple comment to the LZ4 enum type about its deprecation and a reference > to the newly supported type for LZ4. We also need the new LZ4 type and a > proper specification for it. It might be a good idea to start a new doc in > the parquet-format repo about the compression codec specifications. We then > reference the related doc from the thrift file comments. After this is > ready we can initiate a parquet-format release and if it's done we can > start adapting it. > > My idea about adapting this in parquet-mr is to keep the old LZ4 available > by configuration which default is to use the new one. (Of course, we also > need to implement/get the proper implementation of the new LZ4 codec.) > > What do you think? > > Cheers, > Gabor > > On Sat, Mar 6, 2021 at 4:40 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hello, > > > > We should move this forward. > > What should be the procedure for updating parquet-format? > > > > Best regards > > > > Antoine. > > > > > > > > On Tue, 16 Feb 2021 17:25:49 +0100 > > Gabor Szadovszky <ga...@apache.org> wrote: > > > Thank you for the detailed summary of the LZ4 situation, Antoine! > > > > > > The Parquet file format should be properly specified for every > > > implementation. It was the mistake of the parquet-mr developers that we > > > thought the Hadoop implementation of LZ4 is according to the LZ4 > > > specification and the fault of the Parquet community that we extended the > > > parquet-format with the LZ4 option without checking that there are 2 > > > options to choose (not talking about the 3rd Hadoop one). > > > > > > I agree that option 4 is the only way we can step forward from this > > > situation. We shall deprecate LZ4 in parquet-format and in the mean time > > we > > > should agree on which officially specified LZ4 format do we want to > > > introduce. > > > > > > Of course, we may try to improve compatibility with the existing LZ4 > > files > > > but we should not encourage our users to use this compression for now. > > > > > > Because we already have parquet-mr releases that uses the under-specified > > > Haddop LZ4 codec I don't feel it is that urgent to block the current > > > parquet-mr release because of this. We shall update parquet-format to > > make > > > the LZ4 situation clear then create a release. Then, we can start working > > > on deprecating/blocking the write path of the current implementation and > > > implement the properly specified LZ4 support in all the implementations. > > > > > > Regards, > > > Gabor > > > > > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > This is a proposal to deprecate and/or remove LZ4 compression from the > > > > Parquet specification. > > > > > > > > Abstract > > > > -------- > > > > > > > > Despite several attempts by the parquet-cpp developers, we were not > > > > able to reach the point where LZ4-compressed Parquet files are > > > > bidirectionally compatible between parquet-cpp and parquet-mr. Other > > > > implementations are having, or have had, similar issues. My conclusion > > > > is that the Parquet spec doesn't allow independent reimplementation of > > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4 > > > > compression should be removed from the spec (possibly replaced with > > > > another enum value for a properly-specified, interoperable, LZ4-backed > > > > compression scheme). > > > > > > > > About LZ4 > > > > --------- > > > > > > > > LZ4 is an extremely fast compression format and library. Decompression > > > > speeds of 4GB/s can routinely be achieved in pure software, with > > > > compression speeds around 800 MB/s. Compression ratios are close to > > > > those achieved by Snappy and LZO, but at vastly higher speeds. > > > > > > > > Two formats are officially defined by the LZ4 project: the block format > > > > and the frame format. The frame format, as the name suggests, is > > > > higher-level and contains all the information required to decompress > > > > arbitrary data buffers. The block format is to be used when such > > > > information is made available separately (for example as part of > > > > ancillary metadata, protocol headers, etc.). > > > > > > > > Core issue > > > > ---------- > > > > > > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which > > > > seems to be the point of reference to look to when implementing the > > > > Parquet format) uses a home-baked framing around LZ4 block compression > > > > that's not specified anywhere. The only way to get information about it > > > > is to read the Hadoop source code. > > > > > > > > Being unspecified, it also doesn't say if there are additional > > > > constraints on the parameters (e.g. frame size). Such constraints will > > > > be implementation-defined, undocumented, and only discoverable through > > > > tedious testing and iteration, with no guarantee of ever achieving > > > > 100% compatibility. > > > > > > > > Note that LZ4 compression itself officially comes in two formats: a > > > > low-level block format, a higher-level framed format. But parquet-mr > > > > uses a third, custom framing format that's not part of the LZ4 format. > > > > > > > > History of compatibility issues > > > > ------------------------------- > > > > > > > > 1) > > > > > > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in the > > > > Parquet spec meant the LZ4 block format. After all, information > > > > about data size is already available in the Parquet metadata, so no > > > > specific framing is theoretically required around the compressed data. > > > > However, it quickly occurred that this interpretation was incompatible > > > > with files produced by parquet-mr (and vice-versa: files produced by > > > > parquet-cpp could not be read with parquet-mr). > > > > > > > > A first issue was posted to suggest switching the Parquet spec (and > > > > parquet-mr) to the LZ4 framed format: > > > > https://issues.apache.org/jira/browse/PARQUET-1241 > > > > > > > > However, this didn't come to a resolution, because people didn't want > > > > to break compatibility with previous versions of parquet-mr (note that > > > > this concern would necessarily switch the burden of compatibility > > > > breakage onto other implempentations). > > > > > > > > Relatedly, there is an issue open for Hadoop, which also didn't get a > > > > resolution: > > > > https://issues.apache.org/jira/browse/HADOOP-12990 > > > > > > > > 2) > > > > > > > > To avoid making things worse, parquet-cpp developers then decided to > > > > (temporarily?) disable writing LZ4 files from C++: > > > > https://issues.apache.org/jira/browse/ARROW-9424 > > > > > > > > (note that this is parquet-cpp deliberately crippling its own feature > > > > set in order to workaround an issue created by an undocumented format > > > > implemented in parquet-mr) > > > > > > > > At this point, though, the LZ4 *reader* in parquet-cpp could still not > > > > read files produced by parquet-mr. So, besides being frustrating for > > > > C++ users (and users of bindings to the C++ library, e.g. Python, > > > > Ruby...), this decision did also not make interoperability better in > > > > the Java -> C++ direction. > > > > > > > > 3) > > > > > > > > parquet-cpp developers decided to implement a format detection so as to > > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by > > > > previous versions of parquet-cpp. > > > > https://issues.apache.org/jira/browse/PARQUET-1878 > > > > > > > > In addition, the write path was reenabled, this time producing files > > > > that (should!) conform to the parquet-mr implementation of LZ4 > > > > compression. > > > > > > > > This seemed to work fine. > > > > > > > > 4) > > > > > > > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it > > > > was later reported that some files did not read properly. > > > > > > > > Indeed, our (parquet-cpp) compatibility code assumed that a single > > > > "parquet-mr LZ4 frame" was ever written out for a single data page. > > > > However, it turns out that parquet-mr can produce several such frames > > > > for larger data pages (it seems to frame at around 128 kiB boundaries). > > > > > > > > While this seems of course reasonable, the format being undocumented > > > > prevented us from foreseeing this situation. Once the issue diagnosed, > > > > we (parquet-cpp developers) pushed a fix for it: > > > > https://issues.apache.org/jira/browse/ARROW-11301 > > > > > > > > 5) > > > > > > > > We were happy after this later fix... only for a short time. Now the > > > > reverse problem happened: some LZ4 files produced by parquet-cpp cannot > > > > be read by parquet-mr: > > > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168 > > > > > > > > > > I decided that this couldn't be a parquet-cpp issue anymore. Evidently, > > > > the fact that parquet-mr uses an unspecified framing format with > > > > unspecified constraints and limits prevents other implementations from > > > > being compatible. > > > > > > > > So I asked the reporter to open a parquet-mr issue instead, and then > > > > took the liberty of marking the issue as blocker: > > > > https://issues.apache.org/jira/browse/PARQUET-1974 > > > > > > > > 6) > > > > > > > > Unsurprisingly, other Parquet implementations are also suffering from > > > > this. > > > > > > > > * fastparquet (a Parquet implementation written partly in Python) had > > > > initially used the LZ4 block format, then switched to the LZ4 frame > > > > format to achieve compatibility with parquet-cpp (since both are used > > > > in the Python ecosystem): > > > > https://github.com/dask/fastparquet/issues/314 > > > > > > > > Of course, fastparquet probably won't be able to read or write files > > > > compatible with parquet-mr... > > > > > > > > * The Rust Parquet implementation that's part of the Rust Arrow > > > > implementation are also having compatibility issues due to making the > > > > wrong guess: > > > > https://issues.apache.org/jira/browse/ARROW-11381 > > > > > > > > * Impala seems to have had to change their LZ4 implementation to also > > > > match the undocumented parquet-mr framing format > > > > https://issues.apache.org/jira/browse/IMPALA-8617 > > > > > > > > (I have no idea whether this solved all LZ4 problems for Impala, and > > > > whether it broke compatibility with previous Impala versions) > > > > > > > > Possible solutions > > > > ------------------ > > > > > > > > 1) Do nothing. The Parquet ecosystem is forever fragmented, even > > > > though the official Parquet documentation doesn't say so. > > > > > > > > 2) parquet-cpp finally achieves compatibility through additional > > > > reverse-engineering efforts. While I have no authority to prevent > > > > this, I also have zero personal motivation to achieve it anymore. I > > > > suspect other parquet-cpp developers are not significantly more > > > > motivated than I am (but I may be wrong). > > > > > > > > 3) parquet-mr takes its share of the effort by striving to be > > > > compatible with files produced by (the current iteration of) > > > > parquet-cpp. I have no idea whether this is easily doable, and whether > > > > there is any motivation or workforce to do it. Therefore, I will > > > > conservatively rate this as "unlikely in the short term". > > > > > > > > 4) Admitting that LZ4 compatibility currently is not an achievable > > > > goal, the LZ4 compression option is deprecated and removed from the > > > > Parquet spec. Implementations display a warning message when it is > > > > being read. Writing LZ4 files is disabled. > > > > > > > > Discussion > > > > ---------- > > > > > > > > Solution #1 seems evidently undesirable to me (though it also seems > > > > that nobody on the Java side was tremendously impatient to solve this > > > > issue). > > > > > > > > As I said, solution #2 is rather unlikely. > > > > > > > > Java developers would have to evaluate how likely solution #3 is. We > > > > should avoid depending on promises that nobody is really willing to > > > > hold, though. Please only say "we're going to solve this" if you're > > > > really willing to push it through. > > > > > > > > In any case, even if solution #2 or #3 were to be implemented, it would > > > > not necessarily help future implementations, as long as the format is > > > > still unspecified. > > > > > > > > In the end, while a bit frustrating for everyone, solution #4 looks > > like > > > > the only reasonable one for the time being. > > > > > > > > Future LZ4 format > > > > ----------------- > > > > > > > > While we should deprecate the current LZ4 format, this doesn't preclude > > > > to add another, this time well-specified, LZ4 format in the future. It > > > > would simply have to use a new `CompressionCodec` enum value in the > > > > Thrift definition. > > > > > > > > Most certainly, it should be either of the two formats officially > > > > defined by the LZ4 project (the block format or the frame format). > > > > > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > >