Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Antoine Pitrou Tue, 03 Mar 2020 05:46:57 -0800


Well, we shouldn't overdo this either.  We are not trying to replicate
the Parquet format.


Regards

Antoine.


Le 03/03/2020 à 14:36, Fan Liya a écrit :
> I am so glad to see this discussion, and I am willing to provide help from
> the Java side.
> 
> In the proposal, I see the support for basic compression strategies
> (e.g.gzip, snappy).
> IMO, applying a single basic strategy is not likely to achieve performance
> improvement for most scenarios.
> The optimal compression strategy is often obtained by composing basic
> strategies and tuning parameters.
> 
> I hope we can support such highly customized compression strategies.
> 
> Best,
> Liya Fan
> 
> 
> 
> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <[email protected]> wrote:
> 
>>
>> If we want to use a HTTP header, it would be more of a Accept-Encoding
>> header, no?
>>
>> In any case, we would have to put non-standard values there (e.g. lz4),
>> so I'm not sure how desirable it is to repurpose HTTP headers for that,
>> rather than add some dedicated field to the Flight messages.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 03/03/2020 à 12:52, David Li a écrit :
>>> gRPC supports headers so for Flight, we could send essentially an Accept
>>> header and perhaps a Content-Type header.
>>>
>>> David
>>>
>>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <[email protected]>
>> wrote:
>>>
>>>> Hi Wes,
>>>> A few thoughts on this.  In general, I think it is a good idea.  But
>> before
>>>> proceeding, I think the following points are worth discussing:
>>>> 1.  Does this actually improve throughput/latency for Flight? (I think
>> you
>>>> mentioned you would follow-up with benchmarks).
>>>> 2.  I think we should limit the number of supported compression schemes
>> to
>>>> only 1 or 2.  I think the criteria for selection speed and native
>>>> implementations available across the widest possible languages.  As far
>> as
>>>> i can tell zstd only have bindings in java via JNI, but my
>> understanding is
>>>> it is probably the type of compression for our use-cases.  So I think
>>>> zstd + potentially 1 more.
>>>> 3.  Commitment from someone on the Java side to implement this.
>>>> 4.  This doesn't need to be coupled with this change per-se but for
>>>> something like flight it would be good to have a standard mechanism for
>>>> negotiating server/client capabilities (e.g. client doesn't support
>>>> compression or only supports a subset).
>>>>
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <[email protected]>
>> wrote:
>>>>
>>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <[email protected]>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
>>>>>>> In the context of a "next version of the Feather format" ARROW-5510
>>>>>>> (which is consumed only by Python and R at the moment), I have been
>>>>>>> looking at compressing buffers using fast compressors like ZSTD when
>>>>>>> writing the RecordBatch bodies. This could be handled privately as an
>>>>>>> implementation detail of the Feather file, but since ZSTD compression
>>>>>>> could improve throughput in Flight, for example, I thought I would
>>>>>>> bring it up for discussion.
>>>>>>>
>>>>>>> I can see two simple compression strategies:
>>>>>>>
>>>>>>> * Compress the entire message body in one-shot, writing the result
>>>> out
>>>>>>> with an 8-byte int64 prefix indicating the uncompressed size
>>>>>>> * Compress each non-zero-length constituent Buffer prior to writing
>>>> to
>>>>>>> the body (and using the same uncompressed-length-prefix when writing
>>>>>>> the compressed buffer)
>>>>>>>
>>>>>>> The latter strategy is preferable for scenarios where we may project
>>>>>>> out only a few fields from a larger record batch (such as reading
>>>> from
>>>>>>> a memory-mapped file).
>>>>>>
>>>>>> Agreed.  It may also allow using different compression strategies for
>>>>>> different kinds of buffers (for example a bytestream splitting
>> strategy
>>>>>> for floats and doubles, or a delta encoding strategy for integers).
>>>>>
>>>>> If we wanted to allow for different compression to apply to different
>>>>> buffers, I think we will need a new Message type because this would
>>>>> inflate metadata sizes in a way that is not likely to be acceptable
>>>>> for the current uncompressed use case.
>>>>>
>>>>> Here is my strawman proposal
>>>>>
>>>>>
>>>>
>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
>>>>>
>>>>>>> Implementation could be accomplished by one of the following methods:
>>>>>>>
>>>>>>> * Setting a field in Message.custom_metadata
>>>>>>> * Adding a new field to Message
>>>>>>
>>>>>> I think it has to be a new field in Message.  Making it an ignorable
>>>>>> metadata field means non-supporting receivers will decode and
>> interpret
>>>>>> the data wrongly.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Reply via email to