>
> Having utility algorithms to perform data transformations seems fine
> if there is a use for them and maintaining the code in the Arrow
> libraries makes sense.

In principle I agree.   The genesis of this discussion was a request I made
to Liya Fan on the PR for this algorithm [1].

Specifically I asked for more documentation around 2 points:
1.  An overview of what type of algorithms will be added going forward.  I
think this is more of a question of what is generally useful to the
community/Java ecosystem.
2.  How can we ensure the best possible performance (a point previously
raised by Jacques for adapters).  Right now most of the algorithms that
have been contributed cannot get JITed very well due to megamorphic virtual
calls.

I think these two points are essential so we can get to a point where the
algorithms package that isn't labelled in the contrib/experimental
category.   As I pointed out on the PR, if we aren't moving in that
direction, then at least I personally, need to be cognizant of that when
allocating my time for reviews.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/5235

On Wed, Sep 4, 2019 at 12:05 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi,
>
> Having utility algorithms to perform data transformations seems fine
> if there is a use for them and maintaining the code in the Arrow
> libraries makes sense.
>
> I don't understand point #2 "We can transform them to delta vectors
> before IPC". It sounds like you are proposing a data compression
> technique. Should this be a part of the
> sparseness/encoding/compression discussion?
>
> - Wes
>
> On Sun, Sep 1, 2019 at 10:14 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > Dear all,
> >
> > We want to support a feature for conversions between delta vector and
> > partial sum vector. Please give your valuable feedback.
> >
> > Best,
> >
> > Liya Fan
> >
> > What is a delta vector/partial sum vector?
> >
> > Given an integer vector a with length n, its partial sum vector is
> another
> > integer vector b with length n + 1, with values defined as:
> >
> > b(0) = initial sum
> > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
> >
> > Given an integer vector with length n + 1, its delta vector is another
> > integer vector b with length n, with values defined as:
> >
> > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
> >
> > In this issue, we provide utilities to convert between vector and partial
> > sum vector. It is interesting to note that the two operations
> corresponding
> > to the discrete integration and differentian.
> >
> > These conversions have wide applications. For example,
> >
> >    1.
> >
> >    The run-length vector proposed by Micah is based on the partial sum
> >    vector, while the deduplication functionality is based on delta
> vector.
> >    This issue provides conversions between them.
> >    2.
> >
> >    The current VarCharVector/VarBinaryVector implementations are based on
> >    partial sum vector. We can transform them to delta vectors before
> IPC, to
> >    reduce network traffic.
> >    3.
> >
> >    Converting to delta can be considered as a way for data compression.
> To
> >    further reduce the data volume, the operation can be applied more than
> >    once, to further reduce data volume.
> >
> > Points to discuss:
> > The API should be provided at the level of vector or ArrowBuf, or both?
> > 1. If it is based on vector, there can be performance overhead due to
> > virtual method calls.
> > 2. If it is base on ArrowBuf, some underlying details (type width) are
> > exposed to the end user, which is not compliant with the principle of
> > encapsulation.
>

Reply via email to