I think it would be worthwhile to split the discussion into two separate threads. One thread for compression & encodings (which are related or even the same topic), one thread for data integrity.
Regards Antoine. Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > - Compression: > * Use parquet for random access to data elements. > - This is one option, the main downside I see to this is generally > higher encoding/decoding costs. Per below, I think it is reasonable to > wait until we have more data to add compression into the the spec. > * Have the transport layer do buffer specific compression: > - I'm not a fan of this approach. Once nice thing about the current > communication protocols is once you strip away "framing" data all the byte > streams are equivalent. I think the simplicity that follows in code from > this is a nice feature. > > > *Computational efficiency of array encodings:* > >> How does "more efficient computation" play out for operations such as >> hash or join? > > You would still need to likely materialize rows in most case. In some > "join" cases the sparse encoding of the null bitmap buffer could be a win > because it serves as an index to non-null values. > > I think I should clarify that these encodings aren't always a win depending > on workload/data shape, but can have a large impact when used appropriately > (especially at the "Expression evaluation stage"). Also, any wins don't > come for free, to exploit encodings properly will add some level of > complication to existing computation code. > > On a packed sparse array representation: > >> This would be fine for simple SIMD aggregations like count/avg/mean, but >> compacting null slots complicates more advanced parallel routines that >> execute independently and rely on indices aligning with an element's >> logical position. > > > The main use-case I had in mind here was for scenarios like loading data > directly parquet (i.e. nulls are already elided) doing some computation and > then potentially translating to a dense representation. Similarly it > appears other have had advantage in some contexts for saving time at > shuffle [1]. In many cases there is an overlap with RLE, so I'd be open to > removing this from the proposal. > > > *On buffer encodings:* > To paraphrase, the main concern here seems to be it is similar to metadata > that was already removed [2]. > > A few points on this: > 1. There was a typo in the original e-mail on sparse-integer set encoding > where it said "all" values are either null or not null. This should have > read "most" values. The elision of buffers is a separate feature. > 2. I believe these are different then the previous metadata because this > isn't repetitive information. It provides new information about the > contents of buffers not available anywhere else. > 3. The proposal is to create a new message type for the this feature so it > wouldn't be bringing back the old code and hopefully would have minimal > impact on already existing IPC code. > > > *On Compression:* > So far my take is the consensus is that this can probably be applied at the > transport level without being in the spec directly. There might be value > in more specific types of compression at the buffer level, but we should > benchmark them first.. > > *Data Integrity/Digest:* > >> one question is whether this occurs at the table level, column level, >> sequential array level, etc. > > This is a good question, it seemed like the batch level was easiest and > that is why I proposed it, but I'd be open to other options. One nice > thing about the batch level is that it works for all other message types > out of the box (i.e. we can ensure the schema has been transmitted > faithfully). > > Cheers, > Micah > > [1] https://issues.apache.org/jira/browse/ARROW-5821 > [2] https://github.com/apache/arrow/pull/1297/files > [3] https://jira.apache.org/jira/browse/ARROW-300 > > > On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <ptaylor.apa...@gmail.com> > wrote: > >> Hi Micah, >> >> Similar to Jacques I'm not disagreeing, but wondering if they belong in >> Arrow vs. can be done externally. I'm mostly interested in changes that >> might impact SIMD processing, considering Arrow's already made conscious >> design decisions to trade memory for speed. Apologies in advance if I've >> misunderstood any of the proposals. >> >>> a. Add a run-length encoding scheme to efficiently represent repeated >>> values (the actual scheme encodes run ends instead of length to preserve >>> sub-linear random access). >> Couldn't one do RLE at the buffer level via a custom >> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType? >> >>> b. Add a “packed” sparse representation (null values don’t take up >>> space in value buffers) >> This would be fine for simple SIMD aggregations like count/avg/mean, but >> compacting null slots complicates more advanced parallel routines that >> execute independently and rely on indices aligning with an element's >> logical position. >> >> It sounds like here the logical position depends on knowing the number >> of nulls up to that point (via something like sequentially iterating >> both data and validity buffers). An efficient parallel routine would >> likely need to scan beforehand to inflate the packed representation, >> where today it can simply slice/mmap the data buffer directly. >> >>> a. Add frame of reference integer encoding [7] (this allows for lower >>> bit-width encoding of integer types by subtracting a >>> “reference” value from all values in the buffer). >> I agree this is useful, but couldn't it also live in userland/an >> ExtensionType? >> >>> b. Add a sparse integer set encoding. This encoding allows more >>> efficient encoding of validity bit-masks for cases when all values are >>> either null or not null. >> If this is in reference to the discussion at link #4 [1], it sounds >> similar to the BufferLayout metadata that used to exist but was removed >> a while back [2]. Knowing the buffer layouts allows an implementation to >> generically elide any buffer at will, but would probably be a lot to >> bring back in. I can't say whether adding a different set of metadata >> would raise the same concerns issues Jacques mentioned in the JIRA >> thread in [2]. >> >>> Data compression. Similar to encodings but compression is solely for >>> reduction of data at rest/on the wire. The proposal is to allow >>> compression of individual buffers. Right now zstd is proposed, but I >> don’t >>> feel strongly on the specific technologies here. >> What's the goal for this? Random element access into compressed >> in-memory columns, or compression at I/O boundaries? >> >> * If the former, is Parquet a better alternative here? Again, I'm >> cautious about the impact to parallel routines. CPU speeds are >> plateauing while memory and tx/rx keep growing. Compressed element >> access seems to be on the CPU side of that equation (meanwhile parallel >> deflate already exists, and I remember seeing research into parallel >> inflate). >> >> * If the later, could we do a comparison of Arrow dictionary-encoding + >> different compression formats, vs. building them into the spec? I know >> content-aware compression yields significant size reductions, but I >> wonder if the maintenance burden on Arrow contributors is worth the cost >> vs. a simpler dictionary-encoding + streaming gzip. >> >>> Data Integrity. While the arrow file format isn’t meant for archiving >>> data, I think it is important to allow for optional native data integrity >>> checks in the format. To this end, I proposed a new “Digest” message >> type >>> that can be added after other messages to record a digest/hash of the >>> preceding data. I suggested xxhash, but I don’t have a strong opinion >> here, >>> as long as there is some minimal support that can potentially be expanded >>> later. >> :thumbs up: >> >> >> Best, >> Paul >> >> >> 1. >> >> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E >> >> 2. >> >> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902 >> >> On 7/5/19 11:53 AM, Micah Kornfield wrote: >>> Hi Arrow-dev, >>> >>> I’d like to make a straw-man proposal to cover some features that I think >>> would be useful to Arrow, and that I would like to make a >> proof-of-concept >>> implementation for in Java and C++. In particular, the proposal covers >>> allowing for smaller data sizes via compression and encoding [1][2][8], >>> data integrity [3] and avoiding unnecessary data transfer [4][5]. >>> >>> I’ve put together a PR [6] that has proposed changes to the flatbuffer >>> metadata to support the new features. The PR introduces: >>> >>> - >>> >>> A new “SparseRecordBatch” that can support one of multiple possible >>> encodings (both dense and sparse), compression and column elision. >>> - >>> >>> A “Digest” message type to support optional data integrity. >>> >>> >>> Going into more details on the specific features in the PR: >>> >>> 1. >>> >>> Sparse encodings for arrays and buffers. The guiding principles >> behind >>> the suggested encodings are to support encodings that can be >> exploited by >>> compute engines for more efficient computation (I don’t think >> parquet style >>> bit-packing belongs in Arrow). While the encodings don’t maintain >> O(1) >>> data element access, they support sublinear, O(log(N)), element >> access. The >>> suggested encodings are: >>> 1. >>> >>> Array encodings: >>> 1. >>> >>> Add a run-length encoding scheme to efficiently represent >> repeated >>> values (the actual scheme encodes run ends instead of length >>> to preserve >>> sub-linear random access). >>> 2. >>> >>> Add a “packed” sparse representation (null values don’t take up >>> space in value buffers) >>> 2. >>> >>> Buffer encodings: >>> 1. >>> >>> Add frame of reference integer encoding [7] (this allows for >> lower >>> bit-width encoding of integer types by subtracting a >>> “reference” value from >>> all values in the buffer). >>> 2. >>> >>> Add a sparse integer set encoding. This encoding allows more >>> efficient encoding of validity bit-masks for cases when all >> values are >>> either null or not null. >>> 2. >>> >>> Data compression. Similar to encodings but compression is solely for >>> reduction of data at rest/on the wire. The proposal is to allow >>> compression of individual buffers. Right now zstd is proposed, but I >> don’t >>> feel strongly on the specific technologies here. >>> 3. >>> >>> Column Elision. For some use-cases, like structured logging, the >>> overhead of including array metadata for columns with no data present >>> represents non-negligible overhead. The proposal provides a >> mechanism for >>> omitting meta-data for such arrays. >>> 4. >>> >>> Data Integrity. While the arrow file format isn’t meant for >> archiving >>> data, I think it is important to allow for optional native data >> integrity >>> checks in the format. To this end, I proposed a new “Digest” >> message type >>> that can be added after other messages to record a digest/hash of the >>> preceding data. I suggested xxhash, but I don’t have a strong >> opinion here, >>> as long as there is some minimal support that can potentially be >> expanded >>> later. >>> >>> >>> In the proposal I chose to use Tables and Unions everywhere for >> flexibility >>> but in all likelihood some could be replaced by enums. >>> >>> My initial plan would be to solely focus on an IPC mechanism that can >> send >>> a SparseRecordBatch and immediately translate it to a normal RecordBatch >> in >>> both Java and C++. >>> >>> As a practical matter the proposal represents a lot of work to get an MVP >>> working in time for 1.0.0 release (provided they are accepted by the >>> community), so I'd greatly appreciate if anyone wants to collaborate on >>> this. >>> >>> If it is easier I’m happy to start a separate thread for feature if >> people >>> feel like it would make the conversation easier. I can also create a >>> Google Doc for direct comments if that is preferred. >>> >>> Thanks, >>> >>> Micah >>> >>> >>> >>> P.S. In the interest of full disclosure, these ideas evolved in >>> collaboration with Brian Hulette and other colleagues at Google who are >>> interested in making use of Arrow in both internal and external projects. >>> >>> [1] https://issues.apache.org/jira/browse/ARROW-300 >>> >>> [2] https://issues.apache.org/jira/browse/ARROW-5224 >>> >>> [3] >>> >> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E >>> >>> [4] >>> >> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E >>> >>> [5] >>> >> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812 >>> >>> [6] https://github.com/apache/arrow/pull/4815 >>> >>> [7] >>> >> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/ >>> >>> [8] https://issues.apache.org/jira/browse/ARROW-5821 >>> >> >> >