OK, I've created a separate thread for data integrity/digests [1], and retitled this thread to continue the discussion on compression and encodings. As a reminder the PR for the format additions [2] suggested a new SparseRecordBatch that would allow for the following features: 1. Different data encodings at the Array (e.g. RLE) and Buffer levels (e.g. narrower bit-width integers) 2. Compression at the buffer level 3. Eliding all metadata and data for empty columns.
To recap my understanding of the highlights discussion so far: Encodings: There are some concerns over efficiency of some of the encodings in different scenarios. * Eliding null values makes many algorithms less efficient * Joins might become harder with these encodings. * Also the additional code complexity came up on the Arrow sync call. Compression: - Buffer level compression might be too small a granularity for data compression. - General purpose compression at this level might not add much value, so it might be better to keep it at the transport level. Alternative designs: * Put buffer level compression in specific transports (e.g. flight) * Try to use the extension mechanism to support different encodings Thanks, Micah [1] https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E [2] https://github.com/apache/arrow/pull/4815 On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou <anto...@python.org> wrote: > > I think it would be worthwhile to split the discussion into two separate > threads. One thread for compression & encodings (which are related or > even the same topic), one thread for data integrity. > > Regards > > Antoine. > > > Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > > > - Compression: > > * Use parquet for random access to data elements. > > - This is one option, the main downside I see to this is > generally > > higher encoding/decoding costs. Per below, I think it is reasonable to > > wait until we have more data to add compression into the the spec. > > * Have the transport layer do buffer specific compression: > > - I'm not a fan of this approach. Once nice thing about the > current > > communication protocols is once you strip away "framing" data all the > byte > > streams are equivalent. I think the simplicity that follows in code from > > this is a nice feature. > > > > > > *Computational efficiency of array encodings:* > > > >> How does "more efficient computation" play out for operations such as > >> hash or join? > > > > You would still need to likely materialize rows in most case. In some > > "join" cases the sparse encoding of the null bitmap buffer could be a win > > because it serves as an index to non-null values. > > > > I think I should clarify that these encodings aren't always a win > depending > > on workload/data shape, but can have a large impact when used > appropriately > > (especially at the "Expression evaluation stage"). Also, any wins don't > > come for free, to exploit encodings properly will add some level of > > complication to existing computation code. > > > > On a packed sparse array representation: > > > >> This would be fine for simple SIMD aggregations like count/avg/mean, but > >> compacting null slots complicates more advanced parallel routines that > >> execute independently and rely on indices aligning with an element's > >> logical position. > > > > > > The main use-case I had in mind here was for scenarios like loading data > > directly parquet (i.e. nulls are already elided) doing some computation > and > > then potentially translating to a dense representation. Similarly it > > appears other have had advantage in some contexts for saving time at > > shuffle [1]. In many cases there is an overlap with RLE, so I'd be open > to > > removing this from the proposal. > > > > > > *On buffer encodings:* > > To paraphrase, the main concern here seems to be it is similar to > metadata > > that was already removed [2]. > > > > A few points on this: > > 1. There was a typo in the original e-mail on sparse-integer set > encoding > > where it said "all" values are either null or not null. This should have > > read "most" values. The elision of buffers is a separate feature. > > 2. I believe these are different then the previous metadata because this > > isn't repetitive information. It provides new information about the > > contents of buffers not available anywhere else. > > 3. The proposal is to create a new message type for the this feature so > it > > wouldn't be bringing back the old code and hopefully would have minimal > > impact on already existing IPC code. > > > > > > *On Compression:* > > So far my take is the consensus is that this can probably be applied at > the > > transport level without being in the spec directly. There might be value > > in more specific types of compression at the buffer level, but we should > > benchmark them first.. > > > > *Data Integrity/Digest:* > > > >> one question is whether this occurs at the table level, column level, > >> sequential array level, etc. > > > > This is a good question, it seemed like the batch level was easiest and > > that is why I proposed it, but I'd be open to other options. One nice > > thing about the batch level is that it works for all other message types > > out of the box (i.e. we can ensure the schema has been transmitted > > faithfully). > > > > Cheers, > > Micah > > > > [1] https://issues.apache.org/jira/browse/ARROW-5821 > > [2] https://github.com/apache/arrow/pull/1297/files > > [3] https://jira.apache.org/jira/browse/ARROW-300 > > > > > > On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <ptaylor.apa...@gmail.com> > > wrote: > > > >> Hi Micah, > >> > >> Similar to Jacques I'm not disagreeing, but wondering if they belong in > >> Arrow vs. can be done externally. I'm mostly interested in changes that > >> might impact SIMD processing, considering Arrow's already made conscious > >> design decisions to trade memory for speed. Apologies in advance if I've > >> misunderstood any of the proposals. > >> > >>> a. Add a run-length encoding scheme to efficiently represent repeated > >>> values (the actual scheme encodes run ends instead of length to > preserve > >>> sub-linear random access). > >> Couldn't one do RLE at the buffer level via a custom > >> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType? > >> > >>> b. Add a “packed” sparse representation (null values don’t take up > >>> space in value buffers) > >> This would be fine for simple SIMD aggregations like count/avg/mean, but > >> compacting null slots complicates more advanced parallel routines that > >> execute independently and rely on indices aligning with an element's > >> logical position. > >> > >> It sounds like here the logical position depends on knowing the number > >> of nulls up to that point (via something like sequentially iterating > >> both data and validity buffers). An efficient parallel routine would > >> likely need to scan beforehand to inflate the packed representation, > >> where today it can simply slice/mmap the data buffer directly. > >> > >>> a. Add frame of reference integer encoding [7] (this allows for lower > >>> bit-width encoding of integer types by subtracting a > >>> “reference” value from all values in the buffer). > >> I agree this is useful, but couldn't it also live in userland/an > >> ExtensionType? > >> > >>> b. Add a sparse integer set encoding. This encoding allows more > >>> efficient encoding of validity bit-masks for cases when all values are > >>> either null or not null. > >> If this is in reference to the discussion at link #4 [1], it sounds > >> similar to the BufferLayout metadata that used to exist but was removed > >> a while back [2]. Knowing the buffer layouts allows an implementation to > >> generically elide any buffer at will, but would probably be a lot to > >> bring back in. I can't say whether adding a different set of metadata > >> would raise the same concerns issues Jacques mentioned in the JIRA > >> thread in [2]. > >> > >>> Data compression. Similar to encodings but compression is solely for > >>> reduction of data at rest/on the wire. The proposal is to allow > >>> compression of individual buffers. Right now zstd is proposed, but I > >> don’t > >>> feel strongly on the specific technologies here. > >> What's the goal for this? Random element access into compressed > >> in-memory columns, or compression at I/O boundaries? > >> > >> * If the former, is Parquet a better alternative here? Again, I'm > >> cautious about the impact to parallel routines. CPU speeds are > >> plateauing while memory and tx/rx keep growing. Compressed element > >> access seems to be on the CPU side of that equation (meanwhile parallel > >> deflate already exists, and I remember seeing research into parallel > >> inflate). > >> > >> * If the later, could we do a comparison of Arrow dictionary-encoding + > >> different compression formats, vs. building them into the spec? I know > >> content-aware compression yields significant size reductions, but I > >> wonder if the maintenance burden on Arrow contributors is worth the cost > >> vs. a simpler dictionary-encoding + streaming gzip. > >> > >>> Data Integrity. While the arrow file format isn’t meant for archiving > >>> data, I think it is important to allow for optional native data > integrity > >>> checks in the format. To this end, I proposed a new “Digest” message > >> type > >>> that can be added after other messages to record a digest/hash of the > >>> preceding data. I suggested xxhash, but I don’t have a strong opinion > >> here, > >>> as long as there is some minimal support that can potentially be > expanded > >>> later. > >> :thumbs up: > >> > >> > >> Best, > >> Paul > >> > >> > >> 1. > >> > >> > https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E > >> > >> 2. > >> > >> > https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902 > >> > >> On 7/5/19 11:53 AM, Micah Kornfield wrote: > >>> Hi Arrow-dev, > >>> > >>> I’d like to make a straw-man proposal to cover some features that I > think > >>> would be useful to Arrow, and that I would like to make a > >> proof-of-concept > >>> implementation for in Java and C++. In particular, the proposal covers > >>> allowing for smaller data sizes via compression and encoding [1][2][8], > >>> data integrity [3] and avoiding unnecessary data transfer [4][5]. > >>> > >>> I’ve put together a PR [6] that has proposed changes to the flatbuffer > >>> metadata to support the new features. The PR introduces: > >>> > >>> - > >>> > >>> A new “SparseRecordBatch” that can support one of multiple possible > >>> encodings (both dense and sparse), compression and column elision. > >>> - > >>> > >>> A “Digest” message type to support optional data integrity. > >>> > >>> > >>> Going into more details on the specific features in the PR: > >>> > >>> 1. > >>> > >>> Sparse encodings for arrays and buffers. The guiding principles > >> behind > >>> the suggested encodings are to support encodings that can be > >> exploited by > >>> compute engines for more efficient computation (I don’t think > >> parquet style > >>> bit-packing belongs in Arrow). While the encodings don’t maintain > >> O(1) > >>> data element access, they support sublinear, O(log(N)), element > >> access. The > >>> suggested encodings are: > >>> 1. > >>> > >>> Array encodings: > >>> 1. > >>> > >>> Add a run-length encoding scheme to efficiently represent > >> repeated > >>> values (the actual scheme encodes run ends instead of length > >>> to preserve > >>> sub-linear random access). > >>> 2. > >>> > >>> Add a “packed” sparse representation (null values don’t take > up > >>> space in value buffers) > >>> 2. > >>> > >>> Buffer encodings: > >>> 1. > >>> > >>> Add frame of reference integer encoding [7] (this allows for > >> lower > >>> bit-width encoding of integer types by subtracting a > >>> “reference” value from > >>> all values in the buffer). > >>> 2. > >>> > >>> Add a sparse integer set encoding. This encoding allows more > >>> efficient encoding of validity bit-masks for cases when all > >> values are > >>> either null or not null. > >>> 2. > >>> > >>> Data compression. Similar to encodings but compression is solely > for > >>> reduction of data at rest/on the wire. The proposal is to allow > >>> compression of individual buffers. Right now zstd is proposed, but > I > >> don’t > >>> feel strongly on the specific technologies here. > >>> 3. > >>> > >>> Column Elision. For some use-cases, like structured logging, the > >>> overhead of including array metadata for columns with no data > present > >>> represents non-negligible overhead. The proposal provides a > >> mechanism for > >>> omitting meta-data for such arrays. > >>> 4. > >>> > >>> Data Integrity. While the arrow file format isn’t meant for > >> archiving > >>> data, I think it is important to allow for optional native data > >> integrity > >>> checks in the format. To this end, I proposed a new “Digest” > >> message type > >>> that can be added after other messages to record a digest/hash of > the > >>> preceding data. I suggested xxhash, but I don’t have a strong > >> opinion here, > >>> as long as there is some minimal support that can potentially be > >> expanded > >>> later. > >>> > >>> > >>> In the proposal I chose to use Tables and Unions everywhere for > >> flexibility > >>> but in all likelihood some could be replaced by enums. > >>> > >>> My initial plan would be to solely focus on an IPC mechanism that can > >> send > >>> a SparseRecordBatch and immediately translate it to a normal > RecordBatch > >> in > >>> both Java and C++. > >>> > >>> As a practical matter the proposal represents a lot of work to get an > MVP > >>> working in time for 1.0.0 release (provided they are accepted by the > >>> community), so I'd greatly appreciate if anyone wants to collaborate on > >>> this. > >>> > >>> If it is easier I’m happy to start a separate thread for feature if > >> people > >>> feel like it would make the conversation easier. I can also create a > >>> Google Doc for direct comments if that is preferred. > >>> > >>> Thanks, > >>> > >>> Micah > >>> > >>> > >>> > >>> P.S. In the interest of full disclosure, these ideas evolved in > >>> collaboration with Brian Hulette and other colleagues at Google who are > >>> interested in making use of Arrow in both internal and external > projects. > >>> > >>> [1] https://issues.apache.org/jira/browse/ARROW-300 > >>> > >>> [2] https://issues.apache.org/jira/browse/ARROW-5224 > >>> > >>> [3] > >>> > >> > https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E > >>> > >>> [4] > >>> > >> > https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E > >>> > >>> [5] > >>> > >> > https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812 > >>> > >>> [6] https://github.com/apache/arrow/pull/4815 > >>> > >>> [7] > >>> > >> > https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/ > >>> > >>> [8] https://issues.apache.org/jira/browse/ARROW-5821 > >>> > >> > >> > > >