Well, we shouldn't overdo this either. We are not trying to replicate the Parquet format.
Regards Antoine. Le 03/03/2020 à 14:36, Fan Liya a écrit : > I am so glad to see this discussion, and I am willing to provide help from > the Java side. > > In the proposal, I see the support for basic compression strategies > (e.g.gzip, snappy). > IMO, applying a single basic strategy is not likely to achieve performance > improvement for most scenarios. > The optimal compression strategy is often obtained by composing basic > strategies and tuning parameters. > > I hope we can support such highly customized compression strategies. > > Best, > Liya Fan > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org> wrote: > >> >> If we want to use a HTTP header, it would be more of a Accept-Encoding >> header, no? >> >> In any case, we would have to put non-standard values there (e.g. lz4), >> so I'm not sure how desirable it is to repurpose HTTP headers for that, >> rather than add some dedicated field to the Flight messages. >> >> Regards >> >> Antoine. >> >> >> Le 03/03/2020 à 12:52, David Li a écrit : >>> gRPC supports headers so for Flight, we could send essentially an Accept >>> header and perhaps a Content-Type header. >>> >>> David >>> >>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>> >>>> Hi Wes, >>>> A few thoughts on this. In general, I think it is a good idea. But >> before >>>> proceeding, I think the following points are worth discussing: >>>> 1. Does this actually improve throughput/latency for Flight? (I think >> you >>>> mentioned you would follow-up with benchmarks). >>>> 2. I think we should limit the number of supported compression schemes >> to >>>> only 1 or 2. I think the criteria for selection speed and native >>>> implementations available across the widest possible languages. As far >> as >>>> i can tell zstd only have bindings in java via JNI, but my >> understanding is >>>> it is probably the type of compression for our use-cases. So I think >>>> zstd + potentially 1 more. >>>> 3. Commitment from someone on the Java side to implement this. >>>> 4. This doesn't need to be coupled with this change per-se but for >>>> something like flight it would be good to have a standard mechanism for >>>> negotiating server/client capabilities (e.g. client doesn't support >>>> compression or only supports a subset). >>>> >>>> >>>> Thanks, >>>> Micah >>>> >>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >>>> >>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org> >>>> wrote: >>>>>> >>>>>> >>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit : >>>>>>> In the context of a "next version of the Feather format" ARROW-5510 >>>>>>> (which is consumed only by Python and R at the moment), I have been >>>>>>> looking at compressing buffers using fast compressors like ZSTD when >>>>>>> writing the RecordBatch bodies. This could be handled privately as an >>>>>>> implementation detail of the Feather file, but since ZSTD compression >>>>>>> could improve throughput in Flight, for example, I thought I would >>>>>>> bring it up for discussion. >>>>>>> >>>>>>> I can see two simple compression strategies: >>>>>>> >>>>>>> * Compress the entire message body in one-shot, writing the result >>>> out >>>>>>> with an 8-byte int64 prefix indicating the uncompressed size >>>>>>> * Compress each non-zero-length constituent Buffer prior to writing >>>> to >>>>>>> the body (and using the same uncompressed-length-prefix when writing >>>>>>> the compressed buffer) >>>>>>> >>>>>>> The latter strategy is preferable for scenarios where we may project >>>>>>> out only a few fields from a larger record batch (such as reading >>>> from >>>>>>> a memory-mapped file). >>>>>> >>>>>> Agreed. It may also allow using different compression strategies for >>>>>> different kinds of buffers (for example a bytestream splitting >> strategy >>>>>> for floats and doubles, or a delta encoding strategy for integers). >>>>> >>>>> If we wanted to allow for different compression to apply to different >>>>> buffers, I think we will need a new Message type because this would >>>>> inflate metadata sizes in a way that is not likely to be acceptable >>>>> for the current uncompressed use case. >>>>> >>>>> Here is my strawman proposal >>>>> >>>>> >>>> >> https://github.com/apache/arrow/compare/master...wesm:compression-strawman >>>>> >>>>>>> Implementation could be accomplished by one of the following methods: >>>>>>> >>>>>>> * Setting a field in Message.custom_metadata >>>>>>> * Adding a new field to Message >>>>>> >>>>>> I think it has to be a new field in Message. Making it an ignorable >>>>>> metadata field means non-supporting receivers will decode and >> interpret >>>>>> the data wrongly. >>>>>> >>>>>> Regards >>>>>> >>>>>> Antoine. >>>>> >>>> >>> >> >