Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Fan Liya Mon, 08 Jul 2019 00:09:48 -0700

Hi Micah,

Thanks for opening this discussion.


For me, most of the features are super useful, especially RLE and integer
encoding.

IMO, to support these new features, we need some basic algorithms first
(e.g. sort and search).
For example, RLE and sort are often used in combination.
These new features should be at a higher level compared with the basic
algorithms.

Some of the basic algorithms is in progress (e.g. [1] and [2]), but I think
more are needed.

Best,
Liya Fan

[1] https://github.com/apache/arrow/pull/4788
[2] https://github.com/apache/arrow/pull/4699

On Mon, Jul 8, 2019 at 1:23 PM Micah Kornfield <[email protected]>
wrote:

> Hi Paul, Jacques and Antoine,
> Thank you for the valuable feedback.  I'm going to try to address it all in
> this e-mail to help consolidate the conversation.  I've grouped my
> responses by topic and included snippets from other e-mails where relevant.
>
> *Timeline of any features: *
> -  So far the sentiment is that this is too much of the 1.0.0 release.
> This seems reasonable to me.  In general, I tend to be over optimistic on
> what I can get done :)
>
>
> *Design Alternatives proposed (more are welcome)*
> - Encodings:
>   *  Use extension types (and other user land options).   This is one
> potential way of accomplishing these but I think it is suboptimal for a few
> reasons:
>      1.  The encodings are targeted at already existing logical types, not
> new ones.  So it is a little bit awkward to have for a user defined "int32"
> value.
>      2.  The extension types are at a schema level.  It is very useful to
> adapt encodings per batch level.  So in some cases a more compact encoding
> might be warranted but in others using the normal dense encoding would be
> appropriate.
>      3.  Using binary blobs remove much of the efficiency of the encodings
> (i.e. the 4 byte overhead per row).
>      4.  In the long run, I'd like to see encodings be exploited in
> computation engines.  This becomes harder/impossible when using user
> defined types.
>
> - Compression:
>    *  Use parquet for random access to data elements.
>        -  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>    *  Have the transport layer do buffer specific compression:
>       - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
>
>
> *Computational efficiency of array encodings:*
>
> > How does "more efficient computation" play out for operations such as
> > hash or join?
>
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
>
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
>
> On a packed sparse array representation:
>
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
>
>
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
>
>
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
>
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
>
>
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
>
> *Data Integrity/Digest:*
>
> > one question is whether this occurs at the table level, column level,
> > sequential array level, etc.
>
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
>
> Cheers,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
>
>
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <[email protected]>
> wrote:
>
> > Hi Micah,
> >
> > Similar to Jacques I'm not disagreeing, but wondering if they belong in
> > Arrow vs. can be done externally. I'm mostly interested in changes that
> > might impact SIMD processing, considering Arrow's already made conscious
> > design decisions to trade memory for speed. Apologies in advance if I've
> > misunderstood any of the proposals.
> >
> > > a. Add a run-length encoding scheme to efficiently represent repeated
> > > values (the actual scheme encodes run ends instead of length to
> preserve
> > > sub-linear random access).
> > Couldn't one do RLE at the buffer level via a custom
> > FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
> >
> > > b. Add a “packed” sparse representation (null values don’t take up
> > > space in value buffers)
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
> >
> > It sounds like here the logical position depends on knowing the number
> > of nulls up to that point (via something like sequentially iterating
> > both data and validity buffers). An efficient parallel routine would
> > likely need to scan beforehand to inflate the packed representation,
> > where today it can simply slice/mmap the data buffer directly.
> >
> > > a. Add frame of reference integer encoding [7] (this allows for lower
> > > bit-width encoding of integer types by subtracting a
> > > “reference” value from all values in the buffer).
> > I agree this is useful, but couldn't it also live in userland/an
> > ExtensionType?
> >
> > > b. Add a sparse integer set encoding.  This encoding allows more
> > > efficient encoding of validity bit-masks for cases when all values are
> > > either null or not null.
> > If this is in reference to the discussion at link #4 [1], it sounds
> > similar to the BufferLayout metadata that used to exist but was removed
> > a while back [2]. Knowing the buffer layouts allows an implementation to
> > generically elide any buffer at will, but would probably be a lot to
> > bring back in. I can't say whether adding a different set of metadata
> > would raise the same concerns issues Jacques mentioned in the JIRA
> > thread in [2].
> >
> > > Data compression.  Similar to encodings but compression is solely for
> > > reduction of data at rest/on the wire.  The proposal is to allow
> > > compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> > > feel strongly on the specific technologies here.
> > What's the goal for this? Random element access into compressed
> > in-memory columns, or compression at I/O boundaries?
> >
> > * If the former, is Parquet a better alternative here? Again, I'm
> > cautious about the impact to parallel routines. CPU speeds are
> > plateauing while memory and tx/rx keep growing. Compressed element
> > access seems to be on the CPU side of that equation (meanwhile parallel
> > deflate already exists, and I remember seeing research into parallel
> > inflate).
> >
> > * If the later, could we do a comparison of Arrow dictionary-encoding +
> > different compression formats, vs. building them into the spec? I know
> > content-aware compression yields significant size reductions, but I
> > wonder if the maintenance burden on Arrow contributors is worth the cost
> > vs. a simpler dictionary-encoding + streaming gzip.
> >
> > > Data Integrity.  While the arrow file format isn’t meant for archiving
> > > data, I think it is important to allow for optional native data
> integrity
> > > checks in the format.  To this end, I proposed a new “Digest” message
> > type
> > > that can be added after other messages to record a digest/hash of the
> > > preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> > > as long as there is some minimal support that can potentially be
> expanded
> > > later.
> > :thumbs up:
> >
> >
> > Best,
> > Paul
> >
> >
> > 1.
> >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > 2.
> >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
> >
> > On 7/5/19 11:53 AM, Micah Kornfield wrote:
> > > Hi Arrow-dev,
> > >
> > > I’d like to make a straw-man proposal to cover some features that I
> think
> > > would be useful to Arrow, and that I would like to make a
> > proof-of-concept
> > > implementation for in Java and C++.  In particular, the proposal covers
> > > allowing for smaller data sizes via compression and encoding [1][2][8],
> > > data integrity [3] and avoiding unnecessary data transfer [4][5].
> > >
> > > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > > metadata to support the new features.  The PR introduces:
> > >
> > >     -
> > >
> > >     A new “SparseRecordBatch” that can support one of multiple possible
> > >     encodings (both dense and sparse), compression and column elision.
> > >     -
> > >
> > >     A “Digest” message type to support optional data integrity.
> > >
> > >
> > > Going into more details on the specific features in the PR:
> > >
> > >     1.
> > >
> > >     Sparse encodings for arrays and buffers.  The guiding principles
> > behind
> > >     the suggested encodings are to support encodings that can be
> > exploited by
> > >     compute engines for more efficient computation (I don’t think
> > parquet style
> > >     bit-packing belongs in Arrow).  While the encodings don’t maintain
> > O(1)
> > >     data element access, they support sublinear, O(log(N)), element
> > access. The
> > >     suggested encodings are:
> > >     1.
> > >
> > >        Array encodings:
> > >        1.
> > >
> > >           Add a run-length encoding scheme to efficiently represent
> > repeated
> > >           values (the actual scheme encodes run ends instead of length
> > > to preserve
> > >           sub-linear random access).
> > >           2.
> > >
> > >           Add a “packed” sparse representation (null values don’t take
> up
> > >           space in value buffers)
> > >           2.
> > >
> > >        Buffer encodings:
> > >        1.
> > >
> > >           Add frame of reference integer encoding [7] (this allows for
> > lower
> > >           bit-width encoding of integer types by subtracting a
> > > “reference” value from
> > >           all values in the buffer).
> > >           2.
> > >
> > >           Add a sparse integer set encoding.  This encoding allows more
> > >           efficient encoding of validity bit-masks for cases when all
> > values are
> > >           either null or not null.
> > >           2.
> > >
> > >     Data compression.  Similar to encodings but compression is solely
> for
> > >     reduction of data at rest/on the wire.  The proposal is to allow
> > >     compression of individual buffers. Right now zstd is proposed, but
> I
> > don’t
> > >     feel strongly on the specific technologies here.
> > >     3.
> > >
> > >     Column Elision.  For some use-cases, like structured logging, the
> > >     overhead of including array metadata for columns with no data
> present
> > >     represents non-negligible overhead.   The proposal provides a
> > mechanism for
> > >     omitting meta-data for such arrays.
> > >     4.
> > >
> > >     Data Integrity.  While the arrow file format isn’t meant for
> > archiving
> > >     data, I think it is important to allow for optional native data
> > integrity
> > >     checks in the format.  To this end, I proposed a new “Digest”
> > message type
> > >     that can be added after other messages to record a digest/hash of
> the
> > >     preceding data. I suggested xxhash, but I don’t have a strong
> > opinion here,
> > >     as long as there is some minimal support that can potentially be
> > expanded
> > >     later.
> > >
> > >
> > > In the proposal I chose to use Tables and Unions everywhere for
> > flexibility
> > > but in all likelihood some could be replaced by enums.
> > >
> > > My initial plan would be to solely focus on an IPC mechanism that can
> > send
> > > a SparseRecordBatch and immediately translate it to a normal
> RecordBatch
> > in
> > > both Java and C++.
> > >
> > > As a practical matter the proposal represents a lot of work to get an
> MVP
> > > working in time for 1.0.0 release (provided they are accepted by the
> > > community), so I'd greatly appreciate if anyone wants to collaborate on
> > > this.
> > >
> > > If it is easier I’m happy to start a separate thread for feature if
> > people
> > > feel like it would make the conversation easier.  I can also create a
> > > Google Doc for direct comments if that is preferred.
> > >
> > > Thanks,
> > >
> > > Micah
> > >
> > >
> > >
> > > P.S. In the interest of full disclosure, these ideas evolved in
> > > collaboration with Brian Hulette and other colleagues at Google who are
> > > interested in making use of Arrow in both internal and external
> projects.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-300
> > >
> > > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> > >
> > > [3]
> > >
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> > >
> > > [4]
> > >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> > >
> > > [5]
> > >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> > >
> > > [6] https://github.com/apache/arrow/pull/4815
> > >
> > > [7]
> > >
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> > >
> > > [8] https://issues.apache.org/jira/browse/ARROW-5821
> > >
> >
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to