On Tue, Mar 3, 2020, 8:11 PM Fan Liya <liya.fa...@gmail.com> wrote: > Sure. I agree with you that we should not overdo this. > I am wondering if we should provide an option to allow users to plugin > their customized compression strategies. >
Can you provide a patch showing changes to Message.fbs (or Schema.fbs) that make this idea more concrete? > Best, > Liya Fan > > On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > On Tue, Mar 3, 2020, 7:36 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > I am so glad to see this discussion, and I am willing to provide help > > from > > > the Java side. > > > > > > In the proposal, I see the support for basic compression strategies > > > (e.g.gzip, snappy). > > > IMO, applying a single basic strategy is not likely to achieve > > performance > > > improvement for most scenarios. > > > The optimal compression strategy is often obtained by composing basic > > > strategies and tuning parameters. > > > > > > I hope we can support such highly customized compression strategies. > > > > > > > I think very much beyond trivial one-shot buffer level compression is > > probably out of the question for addition to the current "RecordBatch" > > Flatbuffers type, because the additional metadata would add undesirable > > bloat (which I would be against). If people have other ideas it would be > > great to see exactly what you are thinking as far as changes to the > > protocol files. > > > > I'll try to assemble some examples to show the before/after results of > > applying the simple strategy. > > > > > > > > > > Best, > > > Liya Fan > > > > > > > > > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > If we want to use a HTTP header, it would be more of a > Accept-Encoding > > > > header, no? > > > > > > > > In any case, we would have to put non-standard values there (e.g. > lz4), > > > > so I'm not sure how desirable it is to repurpose HTTP headers for > that, > > > > rather than add some dedicated field to the Flight messages. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 03/03/2020 à 12:52, David Li a écrit : > > > > > gRPC supports headers so for Flight, we could send essentially an > > > Accept > > > > > header and perhaps a Content-Type header. > > > > > > > > > > David > > > > > > > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > >> Hi Wes, > > > > >> A few thoughts on this. In general, I think it is a good idea. > But > > > > before > > > > >> proceeding, I think the following points are worth discussing: > > > > >> 1. Does this actually improve throughput/latency for Flight? (I > > think > > > > you > > > > >> mentioned you would follow-up with benchmarks). > > > > >> 2. I think we should limit the number of supported compression > > > schemes > > > > to > > > > >> only 1 or 2. I think the criteria for selection speed and native > > > > >> implementations available across the widest possible languages. > As > > > far > > > > as > > > > >> i can tell zstd only have bindings in java via JNI, but my > > > > understanding is > > > > >> it is probably the type of compression for our use-cases. So I > > think > > > > >> zstd + potentially 1 more. > > > > >> 3. Commitment from someone on the Java side to implement this. > > > > >> 4. This doesn't need to be coupled with this change per-se but > for > > > > >> something like flight it would be good to have a standard > mechanism > > > for > > > > >> negotiating server/client capabilities (e.g. client doesn't > support > > > > >> compression or only supports a subset). > > > > >> > > > > >> > > > > >> Thanks, > > > > >> Micah > > > > >> > > > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <wesmck...@gmail.com> > > > > wrote: > > > > >> > > > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou < > anto...@python.org> > > > > >> wrote: > > > > >>>> > > > > >>>> > > > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit : > > > > >>>>> In the context of a "next version of the Feather format" > > ARROW-5510 > > > > >>>>> (which is consumed only by Python and R at the moment), I have > > been > > > > >>>>> looking at compressing buffers using fast compressors like ZSTD > > > when > > > > >>>>> writing the RecordBatch bodies. This could be handled privately > > as > > > an > > > > >>>>> implementation detail of the Feather file, but since ZSTD > > > compression > > > > >>>>> could improve throughput in Flight, for example, I thought I > > would > > > > >>>>> bring it up for discussion. > > > > >>>>> > > > > >>>>> I can see two simple compression strategies: > > > > >>>>> > > > > >>>>> * Compress the entire message body in one-shot, writing the > > result > > > > >> out > > > > >>>>> with an 8-byte int64 prefix indicating the uncompressed size > > > > >>>>> * Compress each non-zero-length constituent Buffer prior to > > writing > > > > >> to > > > > >>>>> the body (and using the same uncompressed-length-prefix when > > > writing > > > > >>>>> the compressed buffer) > > > > >>>>> > > > > >>>>> The latter strategy is preferable for scenarios where we may > > > project > > > > >>>>> out only a few fields from a larger record batch (such as > reading > > > > >> from > > > > >>>>> a memory-mapped file). > > > > >>>> > > > > >>>> Agreed. It may also allow using different compression > strategies > > > for > > > > >>>> different kinds of buffers (for example a bytestream splitting > > > > strategy > > > > >>>> for floats and doubles, or a delta encoding strategy for > > integers). > > > > >>> > > > > >>> If we wanted to allow for different compression to apply to > > different > > > > >>> buffers, I think we will need a new Message type because this > would > > > > >>> inflate metadata sizes in a way that is not likely to be > acceptable > > > > >>> for the current uncompressed use case. > > > > >>> > > > > >>> Here is my strawman proposal > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > https://github.com/apache/arrow/compare/master...wesm:compression-strawman > > > > >>> > > > > >>>>> Implementation could be accomplished by one of the following > > > methods: > > > > >>>>> > > > > >>>>> * Setting a field in Message.custom_metadata > > > > >>>>> * Adding a new field to Message > > > > >>>> > > > > >>>> I think it has to be a new field in Message. Making it an > > ignorable > > > > >>>> metadata field means non-supporting receivers will decode and > > > > interpret > > > > >>>> the data wrongly. > > > > >>>> > > > > >>>> Regards > > > > >>>> > > > > >>>> Antoine. > > > > >>> > > > > >> > > > > > > > > > > > > > > >