It seems like there is reasonable consensus in the PR. If there are no
further comments I'll start a vote about this within the next several
days
On Mon, Apr 6, 2020 at 10:55 PM Wes McKinney wrote:
>
> I updated the Format proposal again, please have a look
>
> https://github.com/apache/arrow/pul
I updated the Format proposal again, please have a look
https://github.com/apache/arrow/pull/6707
On Wed, Apr 1, 2020 at 10:15 AM Wes McKinney wrote:
>
> For uncompressed, memory mapping is disabled, so all of the bytes are
> being read into RAM. I wanted to show that even when your IO pipe is
>
For uncompressed, memory mapping is disabled, so all of the bytes are
being read into RAM. I wanted to show that even when your IO pipe is
very fast (in the case with an NVMe SSD like I have, > 1GB/s for read
from disk) that you can still load faster with compressed files.
Here were the prior Read
The read times are still with memory mapping for the uncompressed case?
If so, impressive!
Regards
Antoine.
Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> Several pieces of work got done in the last few days:
>
> * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> intero
Several pieces of work got done in the last few days:
* Changing from LZ4 raw to LZ4 frame format (what is recommended for
interoperability)
* Parallelizing both compression and decompression at the field level
Here are the results (using 8 threads on an 8-core laptop). I disabled
the "memory map
Here are the results:
File size: https://ibb.co/71sBsg3
Read time: https://ibb.co/4ZncdF8
Write time: https://ibb.co/xhNkRS2
Code:
https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
(based on https://github.com/apache/arrow/pull/6694)
High level summa
I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
know the read/write times and compression ratios. Shouldn't take too
long
On Wed, Mar 25, 2020 at 10:37 PM Fan Liya wrote:
>
> Thanks a lot for sharing the good results.
>
> As investigated by Wes, we have existing zstd library
Thanks a lot for sharing the good results.
As investigated by Wes, we have existing zstd library for Java (zstd-jni)
[1], and lz4 library for Java (lz4-java) [2].
+1 for the 1024 batch size, as it represents an important scenario where
the batch fits into the L1 cache (IMO).
Best,
Liya Fan
[1] h
If it isn't hard could you run with batch sizes of 1024 or 2048 records? I
think there was a question previously raised if there was benefit for
smaller sizes buffers.
Thanks,
Micah
On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney wrote:
> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield
> wrote:
On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield wrote:
>
> >
> > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > dataset. So that's a huge space savings
>
> One more question on this. What was the a
On Wed, Mar 25, 2020 at 2:32 AM Wes McKinney wrote:
> From what I've found searching on the internet
>
> - Java:
> * ZSTD -- JNI-based library available
> * LZ4 -- both JNI and native Java available
>
> - Go: ZSTD is a C binding, while there is an LZ4 native Go implementation
>
AFAIK, one has acc
>
> Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> dataset. So that's a huge space savings
One more question on this. What was the average row-batch size used? I
see in the proposal some buffers might
>From what I've found searching on the internet
- Java:
* ZSTD -- JNI-based library available
* LZ4 -- both JNI and native Java available
- Go: ZSTD is a C binding, while there is an LZ4 native Go implementation
- Rust: bindings to both C libraries available
- C# wrapper libraries seem to be av
Thanks Wes,
It would be nice if contributors to other languages could express there
opinions on the two compression formats selected (in particular if they
represent challenges in using a suitable library for decompressing)
-Micah
On Tue, Mar 24, 2020 at 3:08 PM Wes McKinney wrote:
> I just
I just opened this pull request with the proposed format additions
based on this discussion:
https://github.com/apache/arrow/pull/6707
If there is more feedback about the details, it would be good to know
it now. In a couple of days I would like to call a vote to see if
there is interest in forma
Le 24/03/2020 à 00:39, Wes McKinney a écrit :
>
> As far as what Micah said about having a limited number of
> compressors: I would be in favor of having just LZ4 and ZSTD.
+1, exactly my thought as well.
Regards
Antoine.
hi folks,
Sorry it's taken me a little while to produce supporting benchmarks.
* I implemented experimental trivial body buffer compression in
https://github.com/apache/arrow/pull/6638
* I hooked up the Arrow IPC file format with compression as the new
Feather V2 format in
https://github.com/apac
Hi Wes,
Thanks a lot for the additional information.
Looking forward to see the good results from your experiments.
Best,
Liya Fan
On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney wrote:
> I see, thank you.
>
> For such a scenario, implementations would need to define a
> "UserDefinedCodec" interf
I see, thank you.
For such a scenario, implementations would need to define a
"UserDefinedCodec" interface to enable codecs to be registered from
third party code, similar to what is done for extension types [1]
I'll update this thread when I get my experimental C++ patch up to see
what I'm think
Hi Wes,
Thanks a lot for your further clarification.
Some of my prelimiary thoughts:
1. We assign a unique GUID to each pair of compression/decompression
strategies. The GUID is stored as part of the Message.custom_metadata. When
receiving the GUID, the receiver knows which decompression strateg
Okay, I guess my question is how the receiver is going to be able to
determine how to "rehydrate" the record batch buffers:
What I've proposed amounts to the following:
* UNCOMPRESSED: the current behavior
* ZSTD/LZ4/...: each buffer is compressed and written with an int64
length prefix
(I'm clo
Hi Wes,
I am thinking of adding an option named "USER_DEFINED" (or something
similar) to enum CompressionType in your proposal.
IMO, this option should be used primarily in Flight.
Best,
Liya Fan
On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney wrote:
> On Tue, Mar 3, 2020, 8:11 PM Fan Liya wrote
On Tue, Mar 3, 2020, 8:11 PM Fan Liya wrote:
> Sure. I agree with you that we should not overdo this.
> I am wondering if we should provide an option to allow users to plugin
> their customized compression strategies.
>
Can you provide a patch showing changes to Message.fbs (or Schema.fbs) that
Sure. I agree with you that we should not overdo this.
I am wondering if we should provide an option to allow users to plugin
their customized compression strategies.
Best,
Liya Fan
On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney wrote:
> On Tue, Mar 3, 2020, 7:36 AM Fan Liya wrote:
>
> > I am so
On Tue, Mar 3, 2020, 7:36 AM Fan Liya wrote:
> I am so glad to see this discussion, and I am willing to provide help from
> the Java side.
>
> In the proposal, I see the support for basic compression strategies
> (e.g.gzip, snappy).
> IMO, applying a single basic strategy is not likely to achieve
Well, we shouldn't overdo this either. We are not trying to replicate
the Parquet format.
Regards
Antoine.
Le 03/03/2020 à 14:36, Fan Liya a écrit :
> I am so glad to see this discussion, and I am willing to provide help from
> the Java side.
>
> In the proposal, I see the support for basic
I am so glad to see this discussion, and I am willing to provide help from
the Java side.
In the proposal, I see the support for basic compression strategies
(e.g.gzip, snappy).
IMO, applying a single basic strategy is not likely to achieve performance
improvement for most scenarios.
The optimal c
If we want to use a HTTP header, it would be more of a Accept-Encoding
header, no?
In any case, we would have to put non-standard values there (e.g. lz4),
so I'm not sure how desirable it is to repurpose HTTP headers for that,
rather than add some dedicated field to the Flight messages.
Regards
gRPC supports headers so for Flight, we could send essentially an Accept
header and perhaps a Content-Type header.
David
On Mon, Mar 2, 2020, 23:15 Micah Kornfield wrote:
> Hi Wes,
> A few thoughts on this. In general, I think it is a good idea. But before
> proceeding, I think the following
Hi Wes,
A few thoughts on this. In general, I think it is a good idea. But before
proceeding, I think the following points are worth discussing:
1. Does this actually improve throughput/latency for Flight? (I think you
mentioned you would follow-up with benchmarks).
2. I think we should limit t
On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou wrote:
>
>
> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > In the context of a "next version of the Feather format" ARROW-5510
> > (which is consumed only by Python and R at the moment), I have been
> > looking at compressing buffers using fast com
I also support compression at the buffer level, and making it an extra
message.
Talking about compression and flight, has anyone tested using grpc's
compression to compress at the transport level (if that's a correct way to
describe it)? I believe only gzip and brotli are currently supported, so
t
Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> In the context of a "next version of the Feather format" ARROW-5510
> (which is consumed only by Python and R at the moment), I have been
> looking at compressing buffers using fast compressors like ZSTD when
> writing the RecordBatch bodies. This c
On Sun, Mar 1, 2020 at 3:01 PM Wes McKinney wrote:
>
> In the context of a "next version of the Feather format" ARROW-5510
> (which is consumed only by Python and R at the moment), I have been
> looking at compressing buffers using fast compressors like ZSTD when
> writing the RecordBatch bodies.
In the context of a "next version of the Feather format" ARROW-5510
(which is consumed only by Python and R at the moment), I have been
looking at compressing buffers using fast compressors like ZSTD when
writing the RecordBatch bodies. This could be handled privately as an
implementation detail of
35 matches
Mail list logo