Hi Antoine, I think Liya Fan raised some good points in his reply but I'd like to answer your questions directly.
> So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? I tried to separate the two concepts into Encodings (things Arrow can operate directly on) and Compression (solely for transport). While there is some overlap I think the two features can be considered separately. For each encoding there is additional implementation complexity to properly exploit it. However, the benefit for some workloads can be large [1][2]. If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. This is a reasonable point. However there is continuum here between file size and read and write times. Parquet will likely always be the smallest with the largest times to convert to and from Arrow. An uncompressed Feather/Arrow file will likely always take the most space but will much faster conversion times. The question is whether a buffer level or some other sub-file level compression scheme provides enough values compared with compressing of the entire Feather file. This is somewhat hand-wavy but if we feel we might want to investigate this further I can write some benchmarks to quantify the differences. Cheers, Micah [1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf [2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou <[email protected]> wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >
