Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Fan Liya Tue, 03 Mar 2020 05:36:57 -0800

I am so glad to see this discussion, and I am willing to provide help from
the Java side.


In the proposal, I see the support for basic compression strategies
(e.g.gzip, snappy).
IMO, applying a single basic strategy is not likely to achieve performance
improvement for most scenarios.
The optimal compression strategy is often obtained by composing basic
strategies and tuning parameters.

I hope we can support such highly customized compression strategies.

Best,
Liya Fan



On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org> wrote:

>
> If we want to use a HTTP header, it would be more of a Accept-Encoding
> header, no?
>
> In any case, we would have to put non-standard values there (e.g. lz4),
> so I'm not sure how desirable it is to repurpose HTTP headers for that,
> rather than add some dedicated field to the Flight messages.
>
> Regards
>
> Antoine.
>
>
> Le 03/03/2020 à 12:52, David Li a écrit :
> > gRPC supports headers so for Flight, we could send essentially an Accept
> > header and perhaps a Content-Type header.
> >
> > David
> >
> > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> >> Hi Wes,
> >> A few thoughts on this.  In general, I think it is a good idea.  But
> before
> >> proceeding, I think the following points are worth discussing:
> >> 1.  Does this actually improve throughput/latency for Flight? (I think
> you
> >> mentioned you would follow-up with benchmarks).
> >> 2.  I think we should limit the number of supported compression schemes
> to
> >> only 1 or 2.  I think the criteria for selection speed and native
> >> implementations available across the widest possible languages.  As far
> as
> >> i can tell zstd only have bindings in java via JNI, but my
> understanding is
> >> it is probably the type of compression for our use-cases.  So I think
> >> zstd + potentially 1 more.
> >> 3.  Commitment from someone on the Java side to implement this.
> >> 4.  This doesn't need to be coupled with this change per-se but for
> >> something like flight it would be good to have a standard mechanism for
> >> negotiating server/client capabilities (e.g. client doesn't support
> >> compression or only supports a subset).
> >>
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>>
> >>>>
> >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> >>>>> In the context of a "next version of the Feather format" ARROW-5510
> >>>>> (which is consumed only by Python and R at the moment), I have been
> >>>>> looking at compressing buffers using fast compressors like ZSTD when
> >>>>> writing the RecordBatch bodies. This could be handled privately as an
> >>>>> implementation detail of the Feather file, but since ZSTD compression
> >>>>> could improve throughput in Flight, for example, I thought I would
> >>>>> bring it up for discussion.
> >>>>>
> >>>>> I can see two simple compression strategies:
> >>>>>
> >>>>> * Compress the entire message body in one-shot, writing the result
> >> out
> >>>>> with an 8-byte int64 prefix indicating the uncompressed size
> >>>>> * Compress each non-zero-length constituent Buffer prior to writing
> >> to
> >>>>> the body (and using the same uncompressed-length-prefix when writing
> >>>>> the compressed buffer)
> >>>>>
> >>>>> The latter strategy is preferable for scenarios where we may project
> >>>>> out only a few fields from a larger record batch (such as reading
> >> from
> >>>>> a memory-mapped file).
> >>>>
> >>>> Agreed.  It may also allow using different compression strategies for
> >>>> different kinds of buffers (for example a bytestream splitting
> strategy
> >>>> for floats and doubles, or a delta encoding strategy for integers).
> >>>
> >>> If we wanted to allow for different compression to apply to different
> >>> buffers, I think we will need a new Message type because this would
> >>> inflate metadata sizes in a way that is not likely to be acceptable
> >>> for the current uncompressed use case.
> >>>
> >>> Here is my strawman proposal
> >>>
> >>>
> >>
> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> >>>
> >>>>> Implementation could be accomplished by one of the following methods:
> >>>>>
> >>>>> * Setting a field in Message.custom_metadata
> >>>>> * Adding a new field to Message
> >>>>
> >>>> I think it has to be a new field in Message.  Making it an ignorable
> >>>> metadata field means non-supporting receivers will decode and
> interpret
> >>>> the data wrongly.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>
> >>
> >
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to