Hi Jacques, I think our e-mails might have crossed, so I'm consolidating my responses from the previous e-mail as well.
I don't think most of this should be targeted for 1.0. It is a lot of > change/enhancement and seems like it would likely substantially delay 1.0. I agree it shouldn't block 1.0. I think time based releases are working well for the community. But if the features are implemented in Java and C++ with integration tests in time for 1.0, should we explicitly rule it out? If not for 1.0 would the subsequent release make sense? What is the driving force for transport compression? Are you seeing that as > a major bottleneck in particular circumstances? (I'm not disagreeing, just > want to clearly define the particular problem you're worried about.) I've been working on a 20% project where we appear to be IO bound for transporting record batches. Also, I believe Ji Liu (tianchen92) has been seeing some of the same bottlenecks with the query engine they are is working on. Trading off some CPU here would allow us to lower the overall latency in the system. You suggested that this be done on the buffer level but it seems like that > maybe too narrow depending on batch size? What is the thinking here about > tradeoffs around message versus batch. Two reasons for this proposal: - I'm not sure if there is much value add at the batch level vs simply compressing the whole transport channel. It could be for small batch sizes compression mostly goes unused. But if it is seen as valuable we could certainly incorporate a batch level aspect as well . - At the buffer level you can use more specialized compression techniques that don't require larger sized data to be effective. For example there is a JIRA open to consider using PFOR [1] which, if I understand correctly, starts being effective once you have ~128 integers. Random thought: what do you think of defining this at the transport level > rather than the record batch level? (e.g. in Arrow Flight). This is one way > to avoid extending the core record batch concept with something that isn't > related to processing (at least in your initial proposal) Per above, this seems like a reasonable approach to me if we want to hold off on buffer level compression. Another use-case for buffer/record-batch level compression would be the Feather file format for only decompressing subset of columns/rows. If this use-case isn't compelling, I'd be happy to hold off adding compression to sparse batches until we have benchmarks showing the trade-off between channel level and buffer level compression. If we implement buffer level encodings we should also see a decent size win on space without compression. Thanks, Micah [1] https://github.com/lemire/FastPFor On Fri, Jul 5, 2019 at 1:48 PM Jacques Nadeau <jacq...@apache.org> wrote: > One question and a random thought: > > What is the driving force for transport compression? Are you seeing that > as a major bottleneck in particular circumstances? (I'm not disagreeing, > just want to clearly define the particular problem you're worried about.) > > Random thought: what do you think of defining this at the transport level > rather than the record batch level? (e.g. in Arrow Flight). This is one way > to avoid extending the core record batch concept with something that isn't > related to processing (at least in your initial proposal). >