I am so glad to see this discussion, and I am willing to provide help from the Java side.
In the proposal, I see the support for basic compression strategies (e.g.gzip, snappy). IMO, applying a single basic strategy is not likely to achieve performance improvement for most scenarios. The optimal compression strategy is often obtained by composing basic strategies and tuning parameters. I hope we can support such highly customized compression strategies. Best, Liya Fan On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org> wrote: > > If we want to use a HTTP header, it would be more of a Accept-Encoding > header, no? > > In any case, we would have to put non-standard values there (e.g. lz4), > so I'm not sure how desirable it is to repurpose HTTP headers for that, > rather than add some dedicated field to the Flight messages. > > Regards > > Antoine. > > > Le 03/03/2020 à 12:52, David Li a écrit : > > gRPC supports headers so for Flight, we could send essentially an Accept > > header and perhaps a Content-Type header. > > > > David > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > >> Hi Wes, > >> A few thoughts on this. In general, I think it is a good idea. But > before > >> proceeding, I think the following points are worth discussing: > >> 1. Does this actually improve throughput/latency for Flight? (I think > you > >> mentioned you would follow-up with benchmarks). > >> 2. I think we should limit the number of supported compression schemes > to > >> only 1 or 2. I think the criteria for selection speed and native > >> implementations available across the widest possible languages. As far > as > >> i can tell zstd only have bindings in java via JNI, but my > understanding is > >> it is probably the type of compression for our use-cases. So I think > >> zstd + potentially 1 more. > >> 3. Commitment from someone on the Java side to implement this. > >> 4. This doesn't need to be coupled with this change per-se but for > >> something like flight it would be good to have a standard mechanism for > >> negotiating server/client capabilities (e.g. client doesn't support > >> compression or only supports a subset). > >> > >> > >> Thanks, > >> Micah > >> > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <anto...@python.org> > >> wrote: > >>>> > >>>> > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit : > >>>>> In the context of a "next version of the Feather format" ARROW-5510 > >>>>> (which is consumed only by Python and R at the moment), I have been > >>>>> looking at compressing buffers using fast compressors like ZSTD when > >>>>> writing the RecordBatch bodies. This could be handled privately as an > >>>>> implementation detail of the Feather file, but since ZSTD compression > >>>>> could improve throughput in Flight, for example, I thought I would > >>>>> bring it up for discussion. > >>>>> > >>>>> I can see two simple compression strategies: > >>>>> > >>>>> * Compress the entire message body in one-shot, writing the result > >> out > >>>>> with an 8-byte int64 prefix indicating the uncompressed size > >>>>> * Compress each non-zero-length constituent Buffer prior to writing > >> to > >>>>> the body (and using the same uncompressed-length-prefix when writing > >>>>> the compressed buffer) > >>>>> > >>>>> The latter strategy is preferable for scenarios where we may project > >>>>> out only a few fields from a larger record batch (such as reading > >> from > >>>>> a memory-mapped file). > >>>> > >>>> Agreed. It may also allow using different compression strategies for > >>>> different kinds of buffers (for example a bytestream splitting > strategy > >>>> for floats and doubles, or a delta encoding strategy for integers). > >>> > >>> If we wanted to allow for different compression to apply to different > >>> buffers, I think we will need a new Message type because this would > >>> inflate metadata sizes in a way that is not likely to be acceptable > >>> for the current uncompressed use case. > >>> > >>> Here is my strawman proposal > >>> > >>> > >> > https://github.com/apache/arrow/compare/master...wesm:compression-strawman > >>> > >>>>> Implementation could be accomplished by one of the following methods: > >>>>> > >>>>> * Setting a field in Message.custom_metadata > >>>>> * Adding a new field to Message > >>>> > >>>> I think it has to be a new field in Message. Making it an ignorable > >>>> metadata field means non-supporting receivers will decode and > interpret > >>>> the data wrongly. > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>> > >> > > >