Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Micah Kornfield Fri, 12 Jul 2019 20:37:34 -0700

Hi Antoine,
I think Liya Fan raised some good points in his reply but I'd like to
answer your questions directly.

> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?

I tried to separate the two concepts into Encodings (things Arrow can
operate directly on) and Compression (solely for transport).  While there
is some overlap I think the two features can be considered separately.

For each encoding there is additional implementation complexity to properly
exploit it.  However, the benefit for some workloads can be large [1][2].

If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.

This is a reasonable point.  However there is  continuum here between file
size and read and write times.  Parquet will likely always be the smallest
with the largest times to convert to and from Arrow.  An uncompressed
Feather/Arrow file will likely always take the most space but will much
faster conversion times.    The question is whether a buffer level or some
other sub-file level compression scheme provides enough values compared
with compressing of the entire Feather file.  This is somewhat hand-wavy
but if we feel we might want to investigate this further I can write some
benchmarks to quantify the differences.

Cheers,
Micah

[1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf

On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou <[email protected]> wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to