[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Micah Kornfield Fri, 12 Jul 2019 01:08:45 -0700

OK, I've created a separate thread for data integrity/digests [1], and
retitled this thread to continue the discussion on compression and
encodings.  As a reminder the PR for the format additions [2] suggested a
new SparseRecordBatch that would allow for the following features:
1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
(e.g. narrower bit-width integers)
2.  Compression at the buffer level
3.  Eliding all metadata and data for empty columns.


To recap my understanding of the highlights discussion so far:

Encodings:
There are some concerns over efficiency of some of the encodings in
different scenarios.
 * Eliding null values makes many algorithms less efficient
 * Joins might become harder with these encodings.
 * Also the additional code complexity came up on the Arrow sync call.

Compression:
- Buffer level compression might be too small a granularity for data
compression.
- General purpose compression at this level might not add much value, so it
might be better to keep it at the transport level.

Alternative designs:
* Put buffer level compression in specific transports (e.g. flight)
* Try to use the extension mechanism to support different encodings

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/4815

On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I think it would be worthwhile to split the discussion into two separate
> threads.  One thread for compression & encodings (which are related or
> even the same topic), one thread for data integrity.
>
> Regards
>
> Antoine.
>
>
> Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> >
> > - Compression:
> >    *  Use parquet for random access to data elements.
> >        -  This is one option, the main downside I see to this is
> generally
> > higher encoding/decoding costs.  Per below, I think it is reasonable to
> > wait until we have more data to add compression into the the spec.
> >    *  Have the transport layer do buffer specific compression:
> >       - I'm not a fan of this approach.  Once nice thing about the
> current
> > communication protocols is once you strip away "framing" data all the
> byte
> > streams are equivalent.  I think the simplicity that follows in code from
> > this is a nice feature.
> >
> >
> > *Computational efficiency of array encodings:*
> >
> >> How does "more efficient computation" play out for operations such as
> >> hash or join?
> >
> > You would still need to likely materialize rows in most case.   In some
> > "join" cases the sparse encoding of the null bitmap buffer could be a win
> > because it serves as an index to non-null values.
> >
> > I think I should clarify that these encodings aren't always a win
> depending
> > on workload/data shape, but can have a large impact when used
> appropriately
> > (especially at the "Expression evaluation stage").  Also, any wins don't
> > come for free, to exploit encodings properly  will add some level of
> > complication to existing computation code.
> >
> > On a packed sparse array representation:
> >
> >> This would be fine for simple SIMD aggregations like count/avg/mean, but
> >> compacting null slots complicates more advanced parallel routines that
> >> execute independently and rely on indices aligning with an element's
> >> logical position.
> >
> >
> > The main use-case I had in mind here was for scenarios like loading data
> > directly parquet (i.e. nulls are already elided) doing some computation
> and
> > then potentially translating to a dense representation.  Similarly it
> > appears other have had advantage in some contexts for saving time at
> > shuffle [1].  In many cases there is an overlap with RLE, so I'd be open
> to
> > removing this from the proposal.
> >
> >
> > *On buffer encodings:*
> > To paraphrase, the main concern here seems to be it is similar to
> metadata
> > that was already removed [2].
> >
> > A few points on this:
> > 1.  There was a typo in the original e-mail on sparse-integer set
> encoding
> > where it said "all" values are either null or not null.  This should have
> > read "most" values.  The elision of buffers is a separate feature.
> > 2.  I believe these are different then the previous metadata because this
> > isn't repetitive information. It provides new information about the
> > contents of buffers not available anywhere else.
> > 3.  The proposal is to create a new message type for the this feature so
> it
> > wouldn't be bringing back the old code and hopefully would have minimal
> > impact on already existing IPC code.
> >
> >
> > *On Compression:*
> > So far my take is the consensus is that this can probably be applied at
> the
> > transport level without being in the spec directly.  There might be value
> > in more specific types of compression at the buffer level, but we should
> > benchmark them first..
> >
> > *Data Integrity/Digest:*
> >
> >> one question is whether this occurs at the table level, column level,
> >> sequential array level, etc.
> >
> > This is a good question, it seemed like the batch level was easiest and
> > that is why I proposed it, but I'd be open to other options.  One nice
> > thing about the batch level is that it works for all other message types
> > out of the box (i.e. we can ensure the schema has been transmitted
> > faithfully).
> >
> > Cheers,
> > Micah
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-5821
> > [2] https://github.com/apache/arrow/pull/1297/files
> > [3] https://jira.apache.org/jira/browse/ARROW-300
> >
> >
> > On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <ptaylor.apa...@gmail.com>
> > wrote:
> >
> >> Hi Micah,
> >>
> >> Similar to Jacques I'm not disagreeing, but wondering if they belong in
> >> Arrow vs. can be done externally. I'm mostly interested in changes that
> >> might impact SIMD processing, considering Arrow's already made conscious
> >> design decisions to trade memory for speed. Apologies in advance if I've
> >> misunderstood any of the proposals.
> >>
> >>> a. Add a run-length encoding scheme to efficiently represent repeated
> >>> values (the actual scheme encodes run ends instead of length to
> preserve
> >>> sub-linear random access).
> >> Couldn't one do RLE at the buffer level via a custom
> >> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
> >>
> >>> b. Add a “packed” sparse representation (null values don’t take up
> >>> space in value buffers)
> >> This would be fine for simple SIMD aggregations like count/avg/mean, but
> >> compacting null slots complicates more advanced parallel routines that
> >> execute independently and rely on indices aligning with an element's
> >> logical position.
> >>
> >> It sounds like here the logical position depends on knowing the number
> >> of nulls up to that point (via something like sequentially iterating
> >> both data and validity buffers). An efficient parallel routine would
> >> likely need to scan beforehand to inflate the packed representation,
> >> where today it can simply slice/mmap the data buffer directly.
> >>
> >>> a. Add frame of reference integer encoding [7] (this allows for lower
> >>> bit-width encoding of integer types by subtracting a
> >>> “reference” value from all values in the buffer).
> >> I agree this is useful, but couldn't it also live in userland/an
> >> ExtensionType?
> >>
> >>> b. Add a sparse integer set encoding.  This encoding allows more
> >>> efficient encoding of validity bit-masks for cases when all values are
> >>> either null or not null.
> >> If this is in reference to the discussion at link #4 [1], it sounds
> >> similar to the BufferLayout metadata that used to exist but was removed
> >> a while back [2]. Knowing the buffer layouts allows an implementation to
> >> generically elide any buffer at will, but would probably be a lot to
> >> bring back in. I can't say whether adding a different set of metadata
> >> would raise the same concerns issues Jacques mentioned in the JIRA
> >> thread in [2].
> >>
> >>> Data compression.  Similar to encodings but compression is solely for
> >>> reduction of data at rest/on the wire.  The proposal is to allow
> >>> compression of individual buffers. Right now zstd is proposed, but I
> >> don’t
> >>> feel strongly on the specific technologies here.
> >> What's the goal for this? Random element access into compressed
> >> in-memory columns, or compression at I/O boundaries?
> >>
> >> * If the former, is Parquet a better alternative here? Again, I'm
> >> cautious about the impact to parallel routines. CPU speeds are
> >> plateauing while memory and tx/rx keep growing. Compressed element
> >> access seems to be on the CPU side of that equation (meanwhile parallel
> >> deflate already exists, and I remember seeing research into parallel
> >> inflate).
> >>
> >> * If the later, could we do a comparison of Arrow dictionary-encoding +
> >> different compression formats, vs. building them into the spec? I know
> >> content-aware compression yields significant size reductions, but I
> >> wonder if the maintenance burden on Arrow contributors is worth the cost
> >> vs. a simpler dictionary-encoding + streaming gzip.
> >>
> >>> Data Integrity.  While the arrow file format isn’t meant for archiving
> >>> data, I think it is important to allow for optional native data
> integrity
> >>> checks in the format.  To this end, I proposed a new “Digest” message
> >> type
> >>> that can be added after other messages to record a digest/hash of the
> >>> preceding data. I suggested xxhash, but I don’t have a strong opinion
> >> here,
> >>> as long as there is some minimal support that can potentially be
> expanded
> >>> later.
> >> :thumbs up:
> >>
> >>
> >> Best,
> >> Paul
> >>
> >>
> >> 1.
> >>
> >>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >>
> >> 2.
> >>
> >>
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
> >>
> >> On 7/5/19 11:53 AM, Micah Kornfield wrote:
> >>> Hi Arrow-dev,
> >>>
> >>> I’d like to make a straw-man proposal to cover some features that I
> think
> >>> would be useful to Arrow, and that I would like to make a
> >> proof-of-concept
> >>> implementation for in Java and C++.  In particular, the proposal covers
> >>> allowing for smaller data sizes via compression and encoding [1][2][8],
> >>> data integrity [3] and avoiding unnecessary data transfer [4][5].
> >>>
> >>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> >>> metadata to support the new features.  The PR introduces:
> >>>
> >>>     -
> >>>
> >>>     A new “SparseRecordBatch” that can support one of multiple possible
> >>>     encodings (both dense and sparse), compression and column elision.
> >>>     -
> >>>
> >>>     A “Digest” message type to support optional data integrity.
> >>>
> >>>
> >>> Going into more details on the specific features in the PR:
> >>>
> >>>     1.
> >>>
> >>>     Sparse encodings for arrays and buffers.  The guiding principles
> >> behind
> >>>     the suggested encodings are to support encodings that can be
> >> exploited by
> >>>     compute engines for more efficient computation (I don’t think
> >> parquet style
> >>>     bit-packing belongs in Arrow).  While the encodings don’t maintain
> >> O(1)
> >>>     data element access, they support sublinear, O(log(N)), element
> >> access. The
> >>>     suggested encodings are:
> >>>     1.
> >>>
> >>>        Array encodings:
> >>>        1.
> >>>
> >>>           Add a run-length encoding scheme to efficiently represent
> >> repeated
> >>>           values (the actual scheme encodes run ends instead of length
> >>> to preserve
> >>>           sub-linear random access).
> >>>           2.
> >>>
> >>>           Add a “packed” sparse representation (null values don’t take
> up
> >>>           space in value buffers)
> >>>           2.
> >>>
> >>>        Buffer encodings:
> >>>        1.
> >>>
> >>>           Add frame of reference integer encoding [7] (this allows for
> >> lower
> >>>           bit-width encoding of integer types by subtracting a
> >>> “reference” value from
> >>>           all values in the buffer).
> >>>           2.
> >>>
> >>>           Add a sparse integer set encoding.  This encoding allows more
> >>>           efficient encoding of validity bit-masks for cases when all
> >> values are
> >>>           either null or not null.
> >>>           2.
> >>>
> >>>     Data compression.  Similar to encodings but compression is solely
> for
> >>>     reduction of data at rest/on the wire.  The proposal is to allow
> >>>     compression of individual buffers. Right now zstd is proposed, but
> I
> >> don’t
> >>>     feel strongly on the specific technologies here.
> >>>     3.
> >>>
> >>>     Column Elision.  For some use-cases, like structured logging, the
> >>>     overhead of including array metadata for columns with no data
> present
> >>>     represents non-negligible overhead.   The proposal provides a
> >> mechanism for
> >>>     omitting meta-data for such arrays.
> >>>     4.
> >>>
> >>>     Data Integrity.  While the arrow file format isn’t meant for
> >> archiving
> >>>     data, I think it is important to allow for optional native data
> >> integrity
> >>>     checks in the format.  To this end, I proposed a new “Digest”
> >> message type
> >>>     that can be added after other messages to record a digest/hash of
> the
> >>>     preceding data. I suggested xxhash, but I don’t have a strong
> >> opinion here,
> >>>     as long as there is some minimal support that can potentially be
> >> expanded
> >>>     later.
> >>>
> >>>
> >>> In the proposal I chose to use Tables and Unions everywhere for
> >> flexibility
> >>> but in all likelihood some could be replaced by enums.
> >>>
> >>> My initial plan would be to solely focus on an IPC mechanism that can
> >> send
> >>> a SparseRecordBatch and immediately translate it to a normal
> RecordBatch
> >> in
> >>> both Java and C++.
> >>>
> >>> As a practical matter the proposal represents a lot of work to get an
> MVP
> >>> working in time for 1.0.0 release (provided they are accepted by the
> >>> community), so I'd greatly appreciate if anyone wants to collaborate on
> >>> this.
> >>>
> >>> If it is easier I’m happy to start a separate thread for feature if
> >> people
> >>> feel like it would make the conversation easier.  I can also create a
> >>> Google Doc for direct comments if that is preferred.
> >>>
> >>> Thanks,
> >>>
> >>> Micah
> >>>
> >>>
> >>>
> >>> P.S. In the interest of full disclosure, these ideas evolved in
> >>> collaboration with Brian Hulette and other colleagues at Google who are
> >>> interested in making use of Arrow in both internal and external
> projects.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/ARROW-300
> >>>
> >>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >>>
> >>> [3]
> >>>
> >>
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> >>>
> >>> [4]
> >>>
> >>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >>>
> >>> [5]
> >>>
> >>
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> >>>
> >>> [6] https://github.com/apache/arrow/pull/4815
> >>>
> >>> [7]
> >>>
> >>
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> >>>
> >>> [8] https://issues.apache.org/jira/browse/ARROW-5821
> >>>
> >>
> >>
> >
>

[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to