Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Antoine Pitrou Fri, 12 Jul 2019 00:16:08 -0700


I think it would be worthwhile to split the discussion into two separate
threads.  One thread for compression & encodings (which are related or
even the same topic), one thread for data integrity.


Regards

Antoine.


Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> 
> - Compression:
>    *  Use parquet for random access to data elements.
>        -  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>    *  Have the transport layer do buffer specific compression:
>       - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
> 
> 
> *Computational efficiency of array encodings:*
> 
>> How does "more efficient computation" play out for operations such as
>> hash or join?
> 
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
> 
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
> 
> On a packed sparse array representation:
> 
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently and rely on indices aligning with an element's
>> logical position.
> 
> 
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
> 
> 
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
> 
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
> 
> 
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
> 
> *Data Integrity/Digest:*
> 
>> one question is whether this occurs at the table level, column level,
>> sequential array level, etc.
> 
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
> 
> Cheers,
> Micah
> 
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
> 
> 
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <ptaylor.apa...@gmail.com>
> wrote:
> 
>> Hi Micah,
>>
>> Similar to Jacques I'm not disagreeing, but wondering if they belong in
>> Arrow vs. can be done externally. I'm mostly interested in changes that
>> might impact SIMD processing, considering Arrow's already made conscious
>> design decisions to trade memory for speed. Apologies in advance if I've
>> misunderstood any of the proposals.
>>
>>> a. Add a run-length encoding scheme to efficiently represent repeated
>>> values (the actual scheme encodes run ends instead of length to preserve
>>> sub-linear random access).
>> Couldn't one do RLE at the buffer level via a custom
>> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
>>
>>> b. Add a “packed” sparse representation (null values don’t take up
>>> space in value buffers)
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently and rely on indices aligning with an element's
>> logical position.
>>
>> It sounds like here the logical position depends on knowing the number
>> of nulls up to that point (via something like sequentially iterating
>> both data and validity buffers). An efficient parallel routine would
>> likely need to scan beforehand to inflate the packed representation,
>> where today it can simply slice/mmap the data buffer directly.
>>
>>> a. Add frame of reference integer encoding [7] (this allows for lower
>>> bit-width encoding of integer types by subtracting a
>>> “reference” value from all values in the buffer).
>> I agree this is useful, but couldn't it also live in userland/an
>> ExtensionType?
>>
>>> b. Add a sparse integer set encoding.  This encoding allows more
>>> efficient encoding of validity bit-masks for cases when all values are
>>> either null or not null.
>> If this is in reference to the discussion at link #4 [1], it sounds
>> similar to the BufferLayout metadata that used to exist but was removed
>> a while back [2]. Knowing the buffer layouts allows an implementation to
>> generically elide any buffer at will, but would probably be a lot to
>> bring back in. I can't say whether adding a different set of metadata
>> would raise the same concerns issues Jacques mentioned in the JIRA
>> thread in [2].
>>
>>> Data compression.  Similar to encodings but compression is solely for
>>> reduction of data at rest/on the wire.  The proposal is to allow
>>> compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>> feel strongly on the specific technologies here.
>> What's the goal for this? Random element access into compressed
>> in-memory columns, or compression at I/O boundaries?
>>
>> * If the former, is Parquet a better alternative here? Again, I'm
>> cautious about the impact to parallel routines. CPU speeds are
>> plateauing while memory and tx/rx keep growing. Compressed element
>> access seems to be on the CPU side of that equation (meanwhile parallel
>> deflate already exists, and I remember seeing research into parallel
>> inflate).
>>
>> * If the later, could we do a comparison of Arrow dictionary-encoding +
>> different compression formats, vs. building them into the spec? I know
>> content-aware compression yields significant size reductions, but I
>> wonder if the maintenance burden on Arrow contributors is worth the cost
>> vs. a simpler dictionary-encoding + streaming gzip.
>>
>>> Data Integrity.  While the arrow file format isn’t meant for archiving
>>> data, I think it is important to allow for optional native data integrity
>>> checks in the format.  To this end, I proposed a new “Digest” message
>> type
>>> that can be added after other messages to record a digest/hash of the
>>> preceding data. I suggested xxhash, but I don’t have a strong opinion
>> here,
>>> as long as there is some minimal support that can potentially be expanded
>>> later.
>> :thumbs up:
>>
>>
>> Best,
>> Paul
>>
>>
>> 1.
>>
>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>
>> 2.
>>
>> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
>>
>> On 7/5/19 11:53 AM, Micah Kornfield wrote:
>>> Hi Arrow-dev,
>>>
>>> I’d like to make a straw-man proposal to cover some features that I think
>>> would be useful to Arrow, and that I would like to make a
>> proof-of-concept
>>> implementation for in Java and C++.  In particular, the proposal covers
>>> allowing for smaller data sizes via compression and encoding [1][2][8],
>>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>>
>>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>>> metadata to support the new features.  The PR introduces:
>>>
>>>     -
>>>
>>>     A new “SparseRecordBatch” that can support one of multiple possible
>>>     encodings (both dense and sparse), compression and column elision.
>>>     -
>>>
>>>     A “Digest” message type to support optional data integrity.
>>>
>>>
>>> Going into more details on the specific features in the PR:
>>>
>>>     1.
>>>
>>>     Sparse encodings for arrays and buffers.  The guiding principles
>> behind
>>>     the suggested encodings are to support encodings that can be
>> exploited by
>>>     compute engines for more efficient computation (I don’t think
>> parquet style
>>>     bit-packing belongs in Arrow).  While the encodings don’t maintain
>> O(1)
>>>     data element access, they support sublinear, O(log(N)), element
>> access. The
>>>     suggested encodings are:
>>>     1.
>>>
>>>        Array encodings:
>>>        1.
>>>
>>>           Add a run-length encoding scheme to efficiently represent
>> repeated
>>>           values (the actual scheme encodes run ends instead of length
>>> to preserve
>>>           sub-linear random access).
>>>           2.
>>>
>>>           Add a “packed” sparse representation (null values don’t take up
>>>           space in value buffers)
>>>           2.
>>>
>>>        Buffer encodings:
>>>        1.
>>>
>>>           Add frame of reference integer encoding [7] (this allows for
>> lower
>>>           bit-width encoding of integer types by subtracting a
>>> “reference” value from
>>>           all values in the buffer).
>>>           2.
>>>
>>>           Add a sparse integer set encoding.  This encoding allows more
>>>           efficient encoding of validity bit-masks for cases when all
>> values are
>>>           either null or not null.
>>>           2.
>>>
>>>     Data compression.  Similar to encodings but compression is solely for
>>>     reduction of data at rest/on the wire.  The proposal is to allow
>>>     compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>>     feel strongly on the specific technologies here.
>>>     3.
>>>
>>>     Column Elision.  For some use-cases, like structured logging, the
>>>     overhead of including array metadata for columns with no data present
>>>     represents non-negligible overhead.   The proposal provides a
>> mechanism for
>>>     omitting meta-data for such arrays.
>>>     4.
>>>
>>>     Data Integrity.  While the arrow file format isn’t meant for
>> archiving
>>>     data, I think it is important to allow for optional native data
>> integrity
>>>     checks in the format.  To this end, I proposed a new “Digest”
>> message type
>>>     that can be added after other messages to record a digest/hash of the
>>>     preceding data. I suggested xxhash, but I don’t have a strong
>> opinion here,
>>>     as long as there is some minimal support that can potentially be
>> expanded
>>>     later.
>>>
>>>
>>> In the proposal I chose to use Tables and Unions everywhere for
>> flexibility
>>> but in all likelihood some could be replaced by enums.
>>>
>>> My initial plan would be to solely focus on an IPC mechanism that can
>> send
>>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>> in
>>> both Java and C++.
>>>
>>> As a practical matter the proposal represents a lot of work to get an MVP
>>> working in time for 1.0.0 release (provided they are accepted by the
>>> community), so I'd greatly appreciate if anyone wants to collaborate on
>>> this.
>>>
>>> If it is easier I’m happy to start a separate thread for feature if
>> people
>>> feel like it would make the conversation easier.  I can also create a
>>> Google Doc for direct comments if that is preferred.
>>>
>>> Thanks,
>>>
>>> Micah
>>>
>>>
>>>
>>> P.S. In the interest of full disclosure, these ideas evolved in
>>> collaboration with Brian Hulette and other colleagues at Google who are
>>> interested in making use of Arrow in both internal and external projects.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-300
>>>
>>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>>>
>>> [3]
>>>
>> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>>>
>>> [4]
>>>
>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>>
>>> [5]
>>>
>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>>>
>>> [6] https://github.com/apache/arrow/pull/4815
>>>
>>> [7]
>>>
>> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>>>
>>> [8] https://issues.apache.org/jira/browse/ARROW-5821
>>>
>>
>>
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to