On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou <solip...@pitrou.net> wrote: > > On Fri, 12 Jul 2019 20:37:15 -0700 > Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > If the latter, I wonder why Parquet cannot simply be used instead of > > > reinventing something similar but different. > > > > This is a reasonable point. However there is continuum here between file > > size and read and write times. Parquet will likely always be the smallest > > with the largest times to convert to and from Arrow. An uncompressed > > Feather/Arrow file will likely always take the most space but will much > > faster conversion times. > > I'm curious whether the Parquet conversion times are inherent to the > Parquet format or due to inefficiencies in the implementation. >
Parquet is fundamentally more complex to decode. Consider several layers of logic that must happen for values to end up in the right place * Data pages are usually compressed, and a column consists of many data pages each having a Thrift header that must be deserialized * Values are usually dictionary-encoded, dictionary indices are encoded using hybrid bit-packed / RLE scheme * Null/not-null is encoded in definition levels * Only non-null values are stored, so when decoding to Arrow, values have to be "moved into place" The current C++ implementation could certainly be made faster. One consideration with Parquet is that the files are much smaller, so when you are reading them over the network the effective end-to-end time including IO and deserialization will frequently win. > Regards > > Antoine. > >