Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

David Li Tue, 03 Mar 2020 03:52:56 -0800

gRPC supports headers so for Flight, we could send essentially an Accept
header and perhaps a Content-Type header.


David

On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com> wrote:

> Hi Wes,
> A few thoughts on this.  In general, I think it is a good idea.  But before
> proceeding, I think the following points are worth discussing:
> 1.  Does this actually improve throughput/latency for Flight? (I think you
> mentioned you would follow-up with benchmarks).
> 2.  I think we should limit the number of supported compression schemes to
> only 1 or 2.  I think the criteria for selection speed and native
> implementations available across the widest possible languages.  As far as
> i can tell zstd only have bindings in java via JNI, but my understanding is
> it is probably the type of compression for our use-cases.  So I think
> zstd + potentially 1 more.
> 3.  Commitment from someone on the Java side to implement this.
> 4.  This doesn't need to be coupled with this change per-se but for
> something like flight it would be good to have a standard mechanism for
> negotiating server/client capabilities (e.g. client doesn't support
> compression or only supports a subset).
>
>
> Thanks,
> Micah
>
> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > >
> > > Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > > > In the context of a "next version of the Feather format" ARROW-5510
> > > > (which is consumed only by Python and R at the moment), I have been
> > > > looking at compressing buffers using fast compressors like ZSTD when
> > > > writing the RecordBatch bodies. This could be handled privately as an
> > > > implementation detail of the Feather file, but since ZSTD compression
> > > > could improve throughput in Flight, for example, I thought I would
> > > > bring it up for discussion.
> > > >
> > > > I can see two simple compression strategies:
> > > >
> > > > * Compress the entire message body in one-shot, writing the result
> out
> > > > with an 8-byte int64 prefix indicating the uncompressed size
> > > > * Compress each non-zero-length constituent Buffer prior to writing
> to
> > > > the body (and using the same uncompressed-length-prefix when writing
> > > > the compressed buffer)
> > > >
> > > > The latter strategy is preferable for scenarios where we may project
> > > > out only a few fields from a larger record batch (such as reading
> from
> > > > a memory-mapped file).
> > >
> > > Agreed.  It may also allow using different compression strategies for
> > > different kinds of buffers (for example a bytestream splitting strategy
> > > for floats and doubles, or a delta encoding strategy for integers).
> >
> > If we wanted to allow for different compression to apply to different
> > buffers, I think we will need a new Message type because this would
> > inflate metadata sizes in a way that is not likely to be acceptable
> > for the current uncompressed use case.
> >
> > Here is my strawman proposal
> >
> >
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> >
> > > > Implementation could be accomplished by one of the following methods:
> > > >
> > > > * Setting a field in Message.custom_metadata
> > > > * Adding a new field to Message
> > >
> > > I think it has to be a new field in Message.  Making it an ignorable
> > > metadata field means non-supporting receivers will decode and interpret
> > > the data wrongly.
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to