Great John, I'd be interesting to hear about progress.
Also, IMO I think we should be only focusing on encoding that have the
potential to be exploited for computational benefits (not just
compressibility). I think this is what distinguishes Arrow from other
formats like Parquet. I think this ech
Thanks Micah, I will see if I can find some time to explore this further.
On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield
wrote:
> Hi John,
> Not Wes, but my thoughts on this are as follows:
>
> 1. Alternate bit/byte arrangements can also be useful for processing [1] in
> addition to compressio
Hi John,
Not Wes, but my thoughts on this are as follows:
1. Alternate bit/byte arrangements can also be useful for processing [1] in
addition to compression.
2. I think they are quite a bit more complicated then the existing schemes
proposed in [2], so I think it would be more expedient to get th
Wes, what do you think about Arrow supporting a new suite of fixed-length
data types that unshuffle on column->Value(i) calls? This would allow
memory/swap compressors and memory maps backed by compressing
filesystems (ZFS) or block devices (VDO) to operate more efficiently.
By doing it with new
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote:
>
> Again, I know very little about Parquet, so your patience is appreciated.
>
> At the moment I can Arrow/mmap a file without having anywhere nearly as
> much available memory as the file size. I can visit random place in the
> file (such
Again, I know very little about Parquet, so your patience is appreciated.
At the moment I can Arrow/mmap a file without having anywhere nearly as
much available memory as the file size. I can visit random place in the
file (such as a binary search if it is ordered) and only the locations
visited
Parquet is most relevant in scenarios filesystem IO is constrained
(spinning rust HDD, network FS, cloud storage / S3 / GCS). For those
use cases memory-mapped Arrow is not viable.
Against local NVMe (> 2000 MB/s read throughput) your mileage may vary.
On Thu, Jan 23, 2020 at 12:06 PM Francois Sa
What's the point of having zero copy if the OS is doing the
decompression in kernel (which trumps the zero-copy argument)? You
might as well just use parquet without filesystem compression. I
prefer to have compression algorithm where the columnar engine can
benefit from it [1] than marginally impr
This could also have utility in memory via things like zram/zswap, right?
Mac also has a memory compressor?
I don't think Parquet is an option for me unless the integration with Arrow
is tighter than I imagine (i.e. zero-copy). That said, I confess I know
next to nothing about Parquet.
On Thu, J
Forgot to give the URL:
https://github.com/apache/arrow/pull/6005
Regards
Antoine.
Le 23/01/2020 à 18:23, Antoine Pitrou a écrit :
>
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> Perhaps related to this thread, are there any current or proposed tools to
>> transform columns for fixe
Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> Perhaps related to this thread, are there any current or proposed tools to
> transform columns for fixed-length data types according to a "shuffle?"
> For precedent see the implementation of the shuffle filter in hdf5.
> https://support.hdfgrou
Perhaps related to this thread, are there any current or proposed tools to
transform columns for fixed-length data types according to a "shuffle?"
For precedent see the implementation of the shuffle filter in hdf5.
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-alg
Hi Ippokratis,
Thank you for the feedback, I have some questions based on the links you
provided.
> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper
I think that lightweight encodings (like the FrameOfReference Micah
suggests) do make a lot of sense for Arrow. There are a few implementations
of those in commercial systems. One related paper in the literature is
http://www.cs.columbia.edu/~orestis/damon15.pdf
I would actually also look into som
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data. Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them. If we start adding sophisticated encodings, we also raise the
> cost of s
On Mon, 22 Jul 2019 08:40:08 -0700
Brian Hulette wrote:
> To me, the most important aspect of this proposal is the addition of sparse
> encodings, and I'm curious if there are any more objections to that
> specifically. So far I believe the only one is that it will make
> computation libraries mor
To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth th
On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou wrote:
>
> On Fri, 12 Jul 2019 20:37:15 -0700
> Micah Kornfield wrote:
> >
> > If the latter, I wonder why Parquet cannot simply be used instead of
> > > reinventing something similar but different.
> >
> > This is a reasonable point. However there
On Fri, 12 Jul 2019 20:37:15 -0700
Micah Kornfield wrote:
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> > reinventing something similar but different.
>
> This is a reasonable point. However there is continuum here between file
> size and read and write times. P
Hi Antoine,
I think Liya Fan raised some good points in his reply but I'd like to
answer your questions directly.
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
I t
@Antoine Pitrou,
Good question. I think the answer depends on the concrete encoding scheme.
For some encoding schemes, it is not a good idea to use them for in-memory
data compression.
For others, it is beneficial to operator directly on the compressed data.
For example, it is beneficial to dire
Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> OK, I've created a separate thread for data integrity/digests [1], and
> retitled this thread to continue the discussion on compression and
> encodings. As a reminder the PR for the format additions [2] suggested a
> new SparseRecordBatch that w
OK, I've created a separate thread for data integrity/digests [1], and
retitled this thread to continue the discussion on compression and
encodings. As a reminder the PR for the format additions [2] suggested a
new SparseRecordBatch that would allow for the following features:
1. Different data e
23 matches
Mail list logo