Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Micah Kornfield Fri, 05 Jul 2019 20:44:04 -0700

Hi Jacques,
I think our e-mails might have crossed, so I'm consolidating my responses
from the previous e-mail as well.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.

I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.    But if the features are implemented in Java and
C++ with integration tests in time for 1.0, should we explicitly rule it
out?  If not for 1.0 would the subsequent release make sense?

What is the driving force for transport compression? Are you seeing that as
> a major bottleneck in particular circumstances? (I'm not disagreeing, just
> want to clearly define the particular problem you're worried about.)

I've been working on a 20% project where we appear to be IO bound for
transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
seeing some of the same bottlenecks with the query engine they are is
working on.  Trading off some CPU here would allow us to lower the overall
latency in the system.

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.

Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use more specialized compression techniques
that don't require larger sized data to be effective.  For example there is
a JIRA open to consider using  PFOR [1] which, if I understand correctly,
starts being effective once you have ~128 integers.

Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal)

Per above, this seems like a reasonable approach to me if we want to hold
off on buffer level compression.  Another use-case for buffer/record-batch
level compression would be the Feather file format for only decompressing
subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
hold off adding compression to sparse batches until we have benchmarks
showing the trade-off between channel level and buffer level compression.

If we implement buffer level encodings we should also see a decent size win
on space without compression.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor

On Fri, Jul 5, 2019 at 1:48 PM Jacques Nadeau <jacq...@apache.org> wrote:

> One question and a random thought:
>
> What is the driving force for transport compression? Are you seeing that
> as a major bottleneck in particular circumstances? (I'm not disagreeing,
> just want to clearly define the particular problem you're worried about.)
>
> Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal).
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to