Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Wes McKinney Sat, 13 Jul 2019 09:36:04 -0700

On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou <solip...@pitrou.net> wrote:
>
> On Fri, 12 Jul 2019 20:37:15 -0700
> Micah Kornfield <emkornfi...@gmail.com> wrote:
> >
> > If the latter, I wonder why Parquet cannot simply be used instead of
> > > reinventing something similar but different.
> >
> > This is a reasonable point.  However there is  continuum here between file
> > size and read and write times.  Parquet will likely always be the smallest
> > with the largest times to convert to and from Arrow.  An uncompressed
> > Feather/Arrow file will likely always take the most space but will much
> > faster conversion times.
>
> I'm curious whether the Parquet conversion times are inherent to the
> Parquet format or due to inefficiencies in the implementation.
>


Parquet is fundamentally more complex to decode. Consider several
layers of logic that must happen for values to end up in the right
place

* Data pages are usually compressed, and a column consists of many
data pages each having a Thrift header that must be deserialized
* Values are usually dictionary-encoded, dictionary indices are
encoded using hybrid bit-packed / RLE scheme
* Null/not-null is encoded in definition levels
* Only non-null values are stored, so when decoding to Arrow, values
have to be "moved into place"

The current C++ implementation could certainly be made faster. One
consideration with Parquet is that the files are much smaller, so when
you are reading them over the network the effective end-to-end time
including IO and deserialization will frequently win.

> Regards
>
> Antoine.
>
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to