Great John, I'd be interesting to hear about progress. Also, IMO I think we should be only focusing on encoding that have the potential to be exploited for computational benefits (not just compressibility). I think this is what distinguishes Arrow from other formats like Parquet. I think this echos some others sentiments on this thread.
Cheers, Micah On Fri, Jan 24, 2020 at 8:28 AM John Muehlhausen <[email protected]> wrote: > Thanks Micah, I will see if I can find some time to explore this further. > > On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <[email protected]> > wrote: > >> Hi John, >> Not Wes, but my thoughts on this are as follows: >> >> 1. Alternate bit/byte arrangements can also be useful for processing [1] >> in >> addition to compression. >> 2. I think they are quite a bit more complicated then the existing schemes >> proposed in [2], so I think it would be more expedient to get the >> integration hooks necessary to work with simpler encodings before going >> with something more complex. I believe the proposal is generic enough to >> support this type of encoding. >> 3. For prototyping, this seems like a potential use of the ExtensionType >> [3] type mechanism already in the specification. >> 4. I don't think these should be new types or part of the basic Array data >> structure. I think having a different container format in the form of >> "SparseRecordBatch" (or perhaps it should be renamed to >> EncodedRecordBatch) >> and keeping the existing types with alternate encodings is a better >> option. >> >> That being said if you have bandwidth to get this working for C++ and Java >> we can potentially setup a separate development branch to see how it >> evolves. Personally, I've not brought my proposal up for discussion >> again, >> because I haven't had bandwidth to work on it, but I still think >> introducing some level of alternate encodings is a good idea. >> >> Cheers, >> Micah >> >> >> [1] >> >> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf >> [2] https://github.com/apache/arrow/pull/4815 >> [3] >> >> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types >> >> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <[email protected]> wrote: >> >> > Wes, what do you think about Arrow supporting a new suite of >> fixed-length >> > data types that unshuffle on column->Value(i) calls? This would allow >> > memory/swap compressors and memory maps backed by compressing >> > filesystems (ZFS) or block devices (VDO) to operate more efficiently. >> > >> > By doing it with new datatypes there is no separate flag to check? >> > >> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <[email protected]> >> wrote: >> > >> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <[email protected]> >> wrote: >> > > > >> > > > Again, I know very little about Parquet, so your patience is >> > appreciated. >> > > > >> > > > At the moment I can Arrow/mmap a file without having anywhere >> nearly as >> > > > much available memory as the file size. I can visit random place in >> > the >> > > > file (such as a binary search if it is ordered) and only the >> locations >> > > > visited by column->Value(i) are paged in. Paging them out happens >> > > without >> > > > my awareness, if necessary. >> > > > >> > > > Does Parquet cover this use-case with the same elegance and at least >> > > equal >> > > > efficiency, or are there more copies/conversions? Perhaps it >> requires >> > > the >> > > > entire file to be transformed into Arrow memory at the beginning? Or >> > on a >> > > > batch/block basis? Or to get this I need to use a non-Arrow API for >> > data >> > > > element access? Etc. >> > > >> > > Data has to be materialized / deserialized from the Parquet file on a >> > > batch-wise per-column basis. The APIs we provide allow batches of >> > > values to be read for a given subset of columns >> > > >> > > > >> > > > IFF it covers the above use-case, which does not mention >> compression or >> > > > encoding, then I could consider whether it is interesting on those >> > > points. >> > > >> > > My point really has to do with Parquet's design which is about >> > > reducing file size. In the following blog post >> > > >> > > https://ursalabs.org/blog/2019-10-columnar-perf/ >> > > >> > > I examined a dataset which is about 4GB as raw Arrow stream/file but >> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal >> > > if you are working with filesystems that yield < 500MB/s (which >> > > includes pretty much all cloud filesystems AFAIK). In clickstream >> > > analytics this kind of compression ratio is not unusual. >> > > >> > > > >> > > > -John >> > > > >> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < >> > > > [email protected]> wrote: >> > > > >> > > > > What's the point of having zero copy if the OS is doing the >> > > > > decompression in kernel (which trumps the zero-copy argument)? You >> > > > > might as well just use parquet without filesystem compression. I >> > > > > prefer to have compression algorithm where the columnar engine can >> > > > > benefit from it [1] than marginally improving a file-system-os >> > > > > specific feature. >> > > > > >> > > > > François >> > > > > >> > > > > [1] Section 4.3 >> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <[email protected]> >> > wrote: >> > > > > > >> > > > > > This could also have utility in memory via things like >> zram/zswap, >> > > right? >> > > > > > Mac also has a memory compressor? >> > > > > > >> > > > > > I don't think Parquet is an option for me unless the integration >> > with >> > > > > Arrow >> > > > > > is tighter than I imagine (i.e. zero-copy). That said, I >> confess I >> > > know >> > > > > > next to nothing about Parquet. >> > > > > > >> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou < >> > [email protected]> >> > > > > wrote: >> > > > > > > >> > > > > > > >> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> > > > > > > > Perhaps related to this thread, are there any current or >> > proposed >> > > > > tools >> > > > > > to >> > > > > > > > transform columns for fixed-length data types according to a >> > > > > "shuffle?" >> > > > > > > > For precedent see the implementation of the shuffle filter >> in >> > > hdf5. >> > > > > > > > >> > > > > > >> > > > > >> > > >> > >> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf >> > > > > > > > >> > > > > > > > For example, the column (length 3) would store bytes 00 00 >> 00 >> > 00 >> > > 00 >> > > > > 00 >> > > > > > 00 >> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 >> 00 >> > 01 >> > > 00 >> > > > > 00 >> > > > > > 00 >> > > > > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not >> > > actually >> > > > > the >> > > > > > > > case). >> > > > > > > > >> > > > > > > > Value(1) would return 00 00 00 02 by referring to some >> metadata >> > > flag >> > > > > > that >> > > > > > > > the column is shuffled, stitching the bytes back together at >> > call >> > > > > time. >> > > > > > > > >> > > > > > > > Thus if the column pages were backed by a memory map to >> > something >> > > > > like >> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% >> > > savings >> > > > > in >> > > > > > > > underlying disk usage due to better run lengths. >> > > > > > > > >> > > > > > > > It would enable a space/time tradeoff that could be useful? >> > The >> > > > > > filesystem >> > > > > > > > itself cannot easily do this particular compression >> transform >> > > since >> > > > > it >> > > > > > > > benefits from knowing the shape of the data. >> > > > > > > >> > > > > > > For the record, there's a pull request adding this encoding to >> > the >> > > > > > > Parquet C++ specification. >> > > > > > > >> > > > > > > Regards >> > > > > > > >> > > > > > > Antoine. >> > > > > >> > > >> > >> >
