Hi Martin, Thanks for the thorough investigation, very nice report. I would have a few questions:
- Is data pre-loaded to RAM before making the measurements? - In "Figure 3: Decompression speed in MB/s", is data size measured before or after uncompression? - In "Figure 4: Compression speed in MB/s", is data size measured before or after compression? - According to "Figure 3: Decompression speed in MB/s", decompression of bs_zstd is almost twice as fast as plain zstd. Do you know what causes this massive speed improvement? Based on the description provided in section 3.2, bs_zstd uses the same zstd compression with an extra step of splitting/combining streams. Since this is extra work, I would have expected bs_zstd to be slower than pure zstd, unless the compressed data becomes so much smaller that it radically improves data access times. However, according to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression than plain zstd, which can not be the reason for the 2x speed-up in itself. - If considering using existing libraries to provide any of the compression algorithms, license compatibility is also an important factor and therefore would be worth mentioning in Section 5. - Are any of the investigated strategies applicable to DECIMAL values? Since floating point values and calculations have an inherent inaccuracy, the DECIMAL type is much more important for storing financial data, which is one of the main use cases of Parquet. Thanks, Zoltan On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <martin.ra...@tum.de> wrote: > Hello folks, > > > thank you for your input. > > > I am finished with my investigation regarding introducing special support > for FP compression in Apache Parquet. > > My report also includes an investigation of lossy compressors though there > are still some things to be cleared out. > > > Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > > Sections 3 4 5 6 are the most important to go over. > > > Let me know if you have any questions or concerns. > > > Regards, > > Martin > > ________________________________ > From: Zoltan Ivanfi <z...@cloudera.com.INVALID> > Sent: Thursday, June 13, 2019 2:16:56 PM > To: Parquet Dev > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > Hi Martin, > > Thanks for your interest in improving Parquet. Efficient encodings are > really important in a big data file format, so this topic is > definitely worth researching and personally I am looking forward to > your report. Whether to add any new encodings to Parquet, however, can > not be answered until we see the results of your findings. > > You mention two paths. One has very small computational overhead but > does not provide significant space savings. The other provides > significant space savings but at the price of a significant > computational overhead. While purely based on these properties both of > them seem "balanced" (one is small effort, small gain; the other is > large effort, large gain) and therefore sound reasonable options, I > would argue that one should also consider development costs, code > complexity and compatibility implications when deciding about whether > a new feature is worth implementing. > > Adding a new encoding or compression to Parquet complicates the > specification of the file format and requires implementing it in every > language binding of the format, which is not only a considerable > effort, but is also error-prone (see LZ4 for an example, which was > added to both the Java and the C++ implementation of Parquet, yet they > are incompatible with each other). And lack of support is not only a > minor annoyance in this case: if one is forced to use an older reader > that does not support the new encoding yet (or a language binding that > does not support it at all), the data simply can not be read. > > In my opinion, no matter how low the computational overhead of a new > encoding is, if it does not provide significant gains, then the > specification clutter, implementation costs and the potential of > compatibility problems greatly outweigh its advantages. For this > reason, I would say that only encodings that provide significant gains > are worth adding. As far as I am concerned, such a new encoding would > be a welcome addition to Parquet. > > Thanks, > > Zoltan > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <martin.ra...@tum.de> > wrote: > > > > Dear all, > > > > thank you for your work on the Apache Parquet format. > > > > We are a group of students at the Technical University of Munich who > would like to extend the available compression and encoding options for > 32-bit and 64-bit floating point data in Apache Parquet. > > The current encodings and compression algorithms offered in Apache > Parquet are heavily specialized towards integer and text data. > > Thus there is an opportunity in reducing both io throughput requirements > and space requirements for handling floating point data by selecting a > specialized compression algorithm. > > > > Currently, I am doing an investigation on the available literature and > publicly available fp compressors. In my investigation I am writing a > report on my findings - the available algorithms, their strengths and > weaknesses, compression rates, compression speeds and decompression speeds, > and licenses. Once finished I will share the report with you and make a > proposal which ones IMO are good candidates for Apache Parquet. > > > > The goal is to add a solution for both 32-bit and 64-bit fp types. I > think that it would be beneficial to offer at the very least two distinct > paths. The first one should offer fast compression and decompression speed > with some but not significant saving in space. The second one should offer > slower compression and decompression speed but with a decent compression > rate. Both lossless. A lossy path will be investigated further and > discussed with the community. > > > > If I get an approval from you – the developers – I can continue with > adding support for the new encoding/compression options in the C++ > implementation of Apache Parquet in Apache Arrow. > > > > Please let me know what you think of this idea and whether you have any > concerns with the plan. > > > > Best regards, > > Martin Radev > > >