Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen
Perhaps related to this thread, are there any current or proposed tools to
transform columns for fixed-length data types according to a "shuffle?"
 For precedent see the implementation of the shuffle filter in hdf5.
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf

For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
02 00 00 00 03  (I'm writing big-endian even if that is not actually the
case).

Value(1) would return 00 00 00 02 by referring to some metadata flag that
the column is shuffled, stitching the bytes back together at call time.

Thus if the column pages were backed by a memory map to something like
zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
underlying disk usage due to better run lengths.

It would enable a space/time tradeoff that could be useful?  The filesystem
itself cannot easily do this particular compression transform since it
benefits from knowing the shape of the data.

-John

On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield 
wrote:
>
> Hi Ippokratis,
> Thank you for the feedback, I have some questions based on the links you
> provided.
>
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
>
>
> This paper seems to suggest more complex encodings I was imagining for the
> the first implementation.  Specifically, I proposed using only codes that
> are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
> the dense bit-packing in an initial version?
>
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
>
> The IPC spec [1] already has some metadata about the ordering of
> dictionaries, but this might not be sufficient.  The paper linked here
> seems to recommend two things:
> 1.  Treating dictionaries as explicit mappings between value and integer
> code today is is implicit because the dictionaries are lists indexed by
> code.  It seems like for forward-compatibility we should add a type enum
to
> the Dictionary Encoding metadata.
> 2.  Adding indexes to the dictionaries.  For this, did you imagine the
> indexes would be transferred or something built up on receiving batches?
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> The main take-away I got after skimming this paper, as it relates to
> encodings, is that encodings (including dictionary) should be dynamic per
> batch.  The other interesting question it raises with respect to Arrow is
> one of the techniques used is delta-encoding.  I believe delta encoding
> requires linear time access.  The dense representations in Arrow was
> designed to have constant time access to elements. One open question on
how
> far we  want to relax this requirement for encoded columns.  My proposal
> uses a form of RLE that provide O(Log(N)) access).
>
> Cheers,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285
>
> On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis 
> wrote:
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
> >
> > Arrow can be used as during shuffles for distributed joins/aggs and
being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
> >
> > Thanks,
> > -Ippokratis.
> >
> >
> > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield 
> > wrote:
> >
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >>
> >>
> >> This is another legitimate concern about complexity.
> >>
> >> To try to limit complexity. I simplified the proposal PR [1] to only
have
> 

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-08-25 Thread Micah Kornfield
Hi Ippokratis,
Thank you for the feedback, I have some questions based on the links you
provided.


> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf


This paper seems to suggest more complex encodings I was imagining for the
the first implementation.  Specifically, I proposed using only codes that
are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
the dense bit-packing in an initial version?

>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .

The IPC spec [1] already has some metadata about the ordering of
dictionaries, but this might not be sufficient.  The paper linked here
seems to recommend two things:
1.  Treating dictionaries as explicit mappings between value and integer
code today is is implicit because the dictionaries are lists indexed by
code.  It seems like for forward-compatibility we should add a type enum to
the Dictionary Encoding metadata.
2.  Adding indexes to the dictionaries.  For this, did you imagine the
indexes would be transferred or something built up on receiving batches?

Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

The main take-away I got after skimming this paper, as it relates to
encodings, is that encodings (including dictionary) should be dynamic per
batch.  The other interesting question it raises with respect to Arrow is
one of the techniques used is delta-encoding.  I believe delta encoding
requires linear time access.  The dense representations in Arrow was
designed to have constant time access to elements. One open question on how
far we  want to relax this requirement for encoded columns.  My proposal
uses a form of RLE that provide O(Log(N)) access).

Cheers,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285

On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis 
wrote:

> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf
>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> Thanks,
> -Ippokratis.
>
>
> On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield 
> wrote:
>
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we start adding sophisticated encodings, we also raise the
>> > cost of supporting Arrow for third-party libraries.
>>
>>
>> This is another legitimate concern about complexity.
>>
>> To try to limit complexity. I simplified the proposal PR [1] to only have
>> 1
>> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
>> scheme (RLE) that I think will have the most benefit if exploited
>> properly.  Compression is removed.
>>
>> I'd like to get closure on the proposal one way or another.  I think now
>> the question to be answered is if we are willing to introduce the
>> additional complexity for the performance improvements they can yield?  Is
>> there more data that people would like to see that would influence their
>> decision?
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/4815
>>
>> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou 
>> wrote:
>>
>> > On Mon, 22 Jul 2019 08:40:08 -0700
>> > Brian Hulette  wrote:
>> > > To me, the most important aspect of this proposal is the addition of
>> > sparse
>> > > encodings, and I'm curious if there are any more objections to that
>> > > specifically. So far I believe the only one is that it will make
>> > > computation libraries more complicated. This is absolutely true, but I
>> > > think it's worth that cost.
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we st

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-08-25 Thread Ippokratis Pandis
I think that lightweight encodings (like the FrameOfReference Micah
suggests) do make a lot of sense for Arrow. There are a few implementations
of those in commercial systems. One related paper in the literature is
http://www.cs.columbia.edu/~orestis/damon15.pdf

I would actually also look into some order-preserving dictionary encodings
for strings that also allow vectorized processing (predicates, joins, ..)
on encoded data, e.g. see
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
 .

Arrow can be used as during shuffles for distributed joins/aggs and being
able to operate on encoded data yields benefits (e.g.
http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

Thanks,
-Ippokratis.


On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield 
wrote:

> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
>
>
> This is another legitimate concern about complexity.
>
> To try to limit complexity. I simplified the proposal PR [1] to only have 1
> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
> scheme (RLE) that I think will have the most benefit if exploited
> properly.  Compression is removed.
>
> I'd like to get closure on the proposal one way or another.  I think now
> the question to be answered is if we are willing to introduce the
> additional complexity for the performance improvements they can yield?  Is
> there more data that people would like to see that would influence their
> decision?
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/4815
>
> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou 
> wrote:
>
> > On Mon, 22 Jul 2019 08:40:08 -0700
> > Brian Hulette  wrote:
> > > To me, the most important aspect of this proposal is the addition of
> > sparse
> > > encodings, and I'm curious if there are any more objections to that
> > > specifically. So far I believe the only one is that it will make
> > > computation libraries more complicated. This is absolutely true, but I
> > > think it's worth that cost.
> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>


-- 
-Ippokratis.


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-25 Thread Micah Kornfield
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.


This is another legitimate concern about complexity.

To try to limit complexity. I simplified the proposal PR [1] to only have 1
buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
scheme (RLE) that I think will have the most benefit if exploited
properly.  Compression is removed.

I'd like to get closure on the proposal one way or another.  I think now
the question to be answered is if we are willing to introduce the
additional complexity for the performance improvements they can yield?  Is
there more data that people would like to see that would influence their
decision?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4815

On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou  wrote:

> On Mon, 22 Jul 2019 08:40:08 -0700
> Brian Hulette  wrote:
> > To me, the most important aspect of this proposal is the addition of
> sparse
> > encodings, and I'm curious if there are any more objections to that
> > specifically. So far I believe the only one is that it will make
> > computation libraries more complicated. This is absolutely true, but I
> > think it's worth that cost.
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.
>
> Regards
>
> Antoine.
>
>
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Antoine Pitrou
On Mon, 22 Jul 2019 08:40:08 -0700
Brian Hulette  wrote:
> To me, the most important aspect of this proposal is the addition of sparse
> encodings, and I'm curious if there are any more objections to that
> specifically. So far I believe the only one is that it will make
> computation libraries more complicated. This is absolutely true, but I
> think it's worth that cost.

It's not just computation libraries, it's any library peeking inside
Arrow data.  Currently, the Arrow data types are simple, which makes it
easy and non-intimidating to build data processing utilities around
them.  If we start adding sophisticated encodings, we also raise the
cost of supporting Arrow for third-party libraries.

Regards

Antoine.




Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Brian Hulette
To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth that cost.

It's been suggested on this list and elsewhere [1] that sparse encodings
that can be operated on without fully decompressing should be added to the
Arrow format. The longer we continue to develop computation libraries
without considering those schemes, the harder it will be to add them.

[1]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html


On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney  wrote:

> On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou 
> wrote:
> >
> > On Fri, 12 Jul 2019 20:37:15 -0700
> > Micah Kornfield  wrote:
> > >
> > > If the latter, I wonder why Parquet cannot simply be used instead of
> > > > reinventing something similar but different.
> > >
> > > This is a reasonable point.  However there is  continuum here between
> file
> > > size and read and write times.  Parquet will likely always be the
> smallest
> > > with the largest times to convert to and from Arrow.  An uncompressed
> > > Feather/Arrow file will likely always take the most space but will much
> > > faster conversion times.
> >
> > I'm curious whether the Parquet conversion times are inherent to the
> > Parquet format or due to inefficiencies in the implementation.
> >
>
> Parquet is fundamentally more complex to decode. Consider several
> layers of logic that must happen for values to end up in the right
> place
>
> * Data pages are usually compressed, and a column consists of many
> data pages each having a Thrift header that must be deserialized
> * Values are usually dictionary-encoded, dictionary indices are
> encoded using hybrid bit-packed / RLE scheme
> * Null/not-null is encoded in definition levels
> * Only non-null values are stored, so when decoding to Arrow, values
> have to be "moved into place"
>
> The current C++ implementation could certainly be made faster. One
> consideration with Parquet is that the files are much smaller, so when
> you are reading them over the network the effective end-to-end time
> including IO and deserialization will frequently win.
>
> > Regards
> >
> > Antoine.
> >
> >
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Wes McKinney
On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou  wrote:
>
> On Fri, 12 Jul 2019 20:37:15 -0700
> Micah Kornfield  wrote:
> >
> > If the latter, I wonder why Parquet cannot simply be used instead of
> > > reinventing something similar but different.
> >
> > This is a reasonable point.  However there is  continuum here between file
> > size and read and write times.  Parquet will likely always be the smallest
> > with the largest times to convert to and from Arrow.  An uncompressed
> > Feather/Arrow file will likely always take the most space but will much
> > faster conversion times.
>
> I'm curious whether the Parquet conversion times are inherent to the
> Parquet format or due to inefficiencies in the implementation.
>

Parquet is fundamentally more complex to decode. Consider several
layers of logic that must happen for values to end up in the right
place

* Data pages are usually compressed, and a column consists of many
data pages each having a Thrift header that must be deserialized
* Values are usually dictionary-encoded, dictionary indices are
encoded using hybrid bit-packed / RLE scheme
* Null/not-null is encoded in definition levels
* Only non-null values are stored, so when decoding to Arrow, values
have to be "moved into place"

The current C++ implementation could certainly be made faster. One
consideration with Parquet is that the files are much smaller, so when
you are reading them over the network the effective end-to-end time
including IO and deserialization will frequently win.

> Regards
>
> Antoine.
>
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-13 Thread Antoine Pitrou
On Fri, 12 Jul 2019 20:37:15 -0700
Micah Kornfield  wrote:
> 
> If the latter, I wonder why Parquet cannot simply be used instead of
> > reinventing something similar but different.  
> 
> This is a reasonable point.  However there is  continuum here between file
> size and read and write times.  Parquet will likely always be the smallest
> with the largest times to convert to and from Arrow.  An uncompressed
> Feather/Arrow file will likely always take the most space but will much
> faster conversion times.

I'm curious whether the Parquet conversion times are inherent to the
Parquet format or due to inefficiencies in the implementation.

Regards

Antoine.




Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Micah Kornfield
Hi Antoine,
I think Liya Fan raised some good points in his reply but I'd like to
answer your questions directly.


> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?

I tried to separate the two concepts into Encodings (things Arrow can
operate directly on) and Compression (solely for transport).  While there
is some overlap I think the two features can be considered separately.

For each encoding there is additional implementation complexity to properly
exploit it.  However, the benefit for some workloads can be large [1][2].

If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.


This is a reasonable point.  However there is  continuum here between file
size and read and write times.  Parquet will likely always be the smallest
with the largest times to convert to and from Arrow.  An uncompressed
Feather/Arrow file will likely always take the most space but will much
faster conversion times.The question is whether a buffer level or some
other sub-file level compression scheme provides enough values compared
with compressing of the entire Feather file.  This is somewhat hand-wavy
but if we feel we might want to investigate this further I can write some
benchmarks to quantify the differences.

Cheers,
Micah

[1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf

On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou  wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Fan Liya
@Antoine Pitrou,

Good question. I think the answer depends on the concrete encoding scheme.

For some encoding schemes, it is not a good idea to use them for in-memory
data compression.
For others, it is beneficial to operator directly on the compressed data.

For example, it is beneficial to directly work on RLE data, with better
locality and fewer cache misses.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou  wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Antoine Pitrou


Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> OK, I've created a separate thread for data integrity/digests [1], and
> retitled this thread to continue the discussion on compression and
> encodings.  As a reminder the PR for the format additions [2] suggested a
> new SparseRecordBatch that would allow for the following features:
> 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> (e.g. narrower bit-width integers)
> 2.  Compression at the buffer level
> 3.  Eliding all metadata and data for empty columns.

So the question is whether this really needs to be in the in-memory
format, i.e. is it desired to operate directly on this compressed
format, or is it solely for transport?

If the latter, I wonder why Parquet cannot simply be used instead of
reinventing something similar but different.

Regards

Antoine.