Re: [DISCUSS][Java] Support non-nullable vectors

Wes McKinney Thu, 10 Sep 2020 08:32:11 -0700

I agree with Jacques here. Perhaps what is needed is an unsafe
non-nullable array accessor layer, then there is no need for flags
etc. We've already been writing a lot of such code in C++ (splitting
between no-nulls and some-nulls paths, see also the BitBlockCounter
stuff we've been doing, is such a thing available in Java yet?) and it
hasn't seemed especially burdensome to me.


On Thu, Sep 10, 2020 at 9:37 AM Jacques Nadeau <jacq...@apache.org> wrote:
>
> This change is undesirable as it optimizes one path and makes several
> others behave in unintended ways.  What happens if a vector with nulls
> shows up? What happens if a user sets a position to a null value in user
> code when this flag set?
>
> If the answer to the above questions is the use is an advanced user, then
> why can't they just call:
> PlatformDependent.getInt(vector.memoryAddress() + position * 4).
>
> Why would we introduce something directly in the vector class for this
> specialized use case? If the user is advanced, that short memory access
> invocation seems fine to use. The whole idea with Arrow is that if you have
> a specialized algorithm, you can hand write memory reads and writes because
> you have predictable memory layout.
>
> On Wed, Sep 9, 2020 at 1:11 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Hi all,
> >
> > Thanks a lot for your previous feedback.
> >
> > Now we have made some investigation and prepared an initial PR supporting
> > the non-nullable IntVector [1], as this represents a common scenario.
> > Some initial observations and conclusions can be made.
> >
> > The basic idea of the PR is to provide a global static final flag
> > (NON_NULLABLE_VECTORS_ENABLED) to enable/disable the new feature.
> > 1. When the flag is enabled, manipulations to the validity buffer would be
> > by-passed if the vector is non-nullable.
> > 2. When the flag is disabled, the behavior is identical to the original
> > code.
> >
> > *First*, we want to show that the change is small. To support the
> > non-nullable IntVector, we need to change classes IntVector,
> > BaseFixedWidthVector (the superclass of IntVector), and BaseValueVector
> > (the superclass of BaseFixedWidthVector). The amount of changes (lines of
> > code) to each class is given below
> >
> > Class
> >
> > Additions (#lines)
> >
> > Deletions (#lines)
> >
> > Total # lines
> >
> > IntVector
> >
> > 12
> >
> > 5
> >
> > 370
> >
> > BaseFixedWidthVector
> >
> > 30
> >
> > 11
> >
> > 925
> >
> > BaseValueVector
> >
> > 7
> >
> > 0
> >
> > 240
> >
> >  It can be seen that the change is small, relative to the class size. In
> > addition, to support additional vector types, we only need to change the
> > sub-classes, and no more need to change the super classes.
> >
> > *Second*, we want to show that the performance improvement is notable. To
> > see this, we give the performance data of the IntBenchmark (with some
> > benchmarks added in the PR). To make the performance as good as possible,
> > we enable the ARROW_ENABLE_UNSAFE_MEMORY_ACCESS flag and disable
> > the ARROW_ENABLE_NULL_CHECK_FOR_GET flag. Below, we give the data with the
> > non-nullable vector flag turned off and on, respectively.
> >
> >
> >
> >
> >
> >
> >
> >
> > *(Vector non-nullable flag off)Benchmark                         Mode  Cnt
> >      Score     Error  UnitsIntBenchmarks.getInt              avgt    5
> >  384.948 ±   3.336  ns/opIntBenchmarks.isNull              avgt    5
> > 1301.005 ±  54.239  ns/opIntBenchmarks.setIntDirectly      avgt    5
> >  15387.486 ± 555.749  ns/opIntBenchmarks.setWithValueHolder  avgt    5
> >  15251.351 ± 134.286  ns/opIntBenchmarks.setWithWriter       avgt    5
> >  28595.586 ± 932.528  ns/op*
> >
> > *(Vector non-nullable flag on)*
> >
> >
> >
> >
> > *IntBenchmarks.getInt              avgt    5    384.630 ±    1.547
> >  ns/opIntBenchmarks.isNull              avgt    5      3.004 ±    0.110
> >  ns/opIntBenchmarks.setIntDirectly      avgt    5  13511.605 ±  135.455
> >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  13035.883 ±  196.081
> >  ns/opIntBenchmarks.setWithWriter       avgt    5  24734.825 ± 1603.708
> >  ns/op*
> >
> > For the *getInt *operation, there is little performance difference. This
> > is because we disable the  ARROW_ENABLE_NULL_CHECK_FOR_GET flag, so
> > manipulations to the validity buffer are by-passed, even if the
> > non-nullable vector flag is off. When the  ARROW_ENABLE_NULL_CHECK_FOR_GET
> > is enabled, there is a 72.4% performance improvement gained by turning on
> > the non-nullable vector flag.
> >
> > For the isNull operation, we see 3 orders of magnitude performance
> > improvements by enabling the non-nullable vector flag.
> >
> > For other operations, we see 12.2%, 13.5% and 14.5% performance
> > improvements by turning on the non-nullable vector flag.
> >
> > So it can be seen that notable performance improvements can be gained for
> > non-nullable vectors.
> >
> > *Third*, we want to claim that for nullable vectors and scenarios when we
> > turn off the non-nullable vector flag, the new changes do not introduce
> > performance regression. Such concern is plausible, as our changes add some
> > if-else branches to the code, which may degrade performance.
> >
> > We give the benchmark results of the original code, as below.
> >
> > *(original ocde)*
> >
> >
> >
> >
> >
> > *Benchmark                         Mode  Cnt      Score     Error
> >  UnitsIntBenchmarks.getInt              avgt    5    383.511 ±   0.156
> >  ns/opIntBenchmarks.isNull              avgt    5   1274.271 ±  19.092
> >  ns/opIntBenchmarks.setIntDirectly      avgt    5  15162.219 ± 194.956
> >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  15247.640 ± 153.103
> >  ns/opIntBenchmarks.setWithWriter       avgt    5  28587.780 ± 160.458
> >  ns/op*
> >
> > By comparing this set of results with the above results with non-nullable
> > vector flag disabled, little performance difference can be observed,
> > indicating no performance regression.
> >
> > By examining the generated assembly, it can be seen that JIT is smart
> > enough to remove the if-else branch completely (the below screen-shot gives
> > an example where the if branch is optimized away in the assembly)
> >
> > So if some users do not like this feature, because their vectors are
> > always nullable, they can simply disable the flag, and no performance
> > difference can be observed.
> >
> > Would you please give some feedback?
> > If it looks good to you, maybe we can go ahead to support other types of
> > vectors.
> > Thank you in advance.
> >
> > Liya Fan
> >
> > [image: image.png]
> >
> > [1] https://github.com/apache/arrow/pull/8147
> >
> > On Fri, Mar 13, 2020 at 9:47 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> >> Hi Jacques,
> >>
> >> Thanks a lot for your valuable comments.
> >>
> >> I agree with you that collapsing nullable and non-nullable
> >> implementations is a good idea, and it does not contradict with the idea of
> >> introducing a fast code path, if it does not introduce much cost or code
> >> complexity.
> >>
> >> The idea of word level checking is interesting.
> >>
> >> As you suggested, we will do more investigations and reconsider the
> >> problem from a broader perspective.
> >>
> >> Best,
> >> Liya Fan
> >>
> >>
> >>
> >> On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>
> >>> Generally Ive found that this isnt an important optimization in the use
> >>> cases we see. Memory overhead, especially with our Java shared allocation
> >>> scheme is nominal. Optimizing null checks at the word level usually is
> >>> much
> >>> more impactful since non null and null runs are much more common on a
> >>> shorter window common than they seeing those declared.
> >>>
> >>> In other words, I'd suggest you look at your problem with a broader
> >>> perspective and see whether you're actually focused on optimizing the
> >>> most
> >>> important dimension.
> >>>
> >>> As an aside, the original Arrow Java code actually treated these concepts
> >>> more distinctly and we consciously made a decision to collapse them to
> >>> simplify real world use.
> >>>
> >>> I do think it makes to add a dirty read interface if you want. This would
> >>> allow consumers of the interface to behave efficiently if they wanted to.
> >>>
> >>> One last note, efficient evaluation and processing should generally
> >>> always
> >>> work at the validity word level. Adding an extra if check at the word
> >>> versus extra complexity of having an early out per batch seems like a
> >>> pretty small life in the grand scheme of processing.
> >>>
> >>> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hulet...@gmail.com> wrote:
> >>>
> >>> > > And there is a "nullable" metadata-only flag at the
> >>> > > Field level. Could the same kinds of optimizations be implemented in
> >>> > > Java without introducing a "nullable" concept?
> >>> >
> >>> > Note Liya Fan did suggest pulling the nullable flag from the Field
> >>> when the
> >>> > vector is created in item (1) of the proposed changes.
> >>> >
> >>> > Brian
> >>> >
> >>> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >>> >
> >>> > > Hi Micah,
> >>> > >
> >>> > > Thanks a lot for your valuable comments. Please see my comments
> >>> inline.
> >>> > >
> >>> > > > I'm a little concerned that this will change assumptions for at
> >>> least
> >>> > > some
> >>> > > > of the clients using the library (some might always rely on the
> >>> > validity
> >>> > > > buffer being present).
> >>> > >
> >>> > > I can understand your concern and I am also concerned.
> >>> > > IMO, the client should not depend on this assumption, as the
> >>> > specification
> >>> > > says "Arrays having a 0 null count may choose to not allocate the
> >>> > validity
> >>> > > bitmap." [1]
> >>> > > That being said, I think it would be safe to provide a global flag to
> >>> > > switch on/off the feature (as you suggested).
> >>> > >
> >>> > > > I think this is a good feature to have for the reasons you
> >>> mentioned.
> >>> > It
> >>> > > > seems like there would need to be some sort of configuration bit
> >>> to set
> >>> > > for
> >>> > > > this behavior.
> >>> > >
> >>> > > Good suggestion. We should be able to switch on and off the feature
> >>> with
> >>> > a
> >>> > > single global flag.
> >>> > >
> >>> > > > But, I'd be worried about code complexity this would
> >>> > > > introduce.
> >>> > >
> >>> > > I agree with you that code complexity is an important factor to
> >>> consider.
> >>> > > IMO, our proposal should not involve too much code change, or
> >>> increase
> >>> > code
> >>> > > complexity too much.
> >>> > > To prove this, maybe we need to show some small experimental code
> >>> change.
> >>> > >
> >>> > > Best,
> >>> > > Liya Fan
> >>> > >
> >>> > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
> >>> > >
> >>> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>> > > wrote:
> >>> > >
> >>> > > > Hi Liya Fan,
> >>> > > > I'm a little concerned that this will change assumptions for at
> >>> least
> >>> > > some
> >>> > > > of the clients using the library (some might always rely on the
> >>> > validity
> >>> > > > buffer being present).
> >>> > > >
> >>> > > > I think this is a good feature to have for the reasons you
> >>> mentioned.
> >>> > It
> >>> > > > seems like there would need to be some sort of configuration bit
> >>> to set
> >>> > > for
> >>> > > > this behavior. But, I'd be worried about code complexity this would
> >>> > > > introduce.
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Micah
> >>> > > >
> >>> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <liya.fa...@gmail.com>
> >>> wrote:
> >>> > > >
> >>> > > > > Hi Wes,
> >>> > > > >
> >>> > > > > Thanks a lot for your quick reply.
> >>> > > > > I think what you mentioned is almost exactly what we want to do
> >>> in
> >>> > > > Java.The
> >>> > > > > concept is not important.
> >>> > > > >
> >>> > > > > Maybe there are only some minor differences:
> >>> > > > > 1. In C++, the null_count is mutable, while for Java, once a
> >>> vector
> >>> > is
> >>> > > > > constructed as non-nullable, its null count can only be 0.
> >>> > > > > 2. In C++, a non-nullable array's validity buffer is null, while
> >>> in
> >>> > > Java,
> >>> > > > > the buffer is an empty buffer, and cannot be changed.
> >>> > > > >
> >>> > > > > Best,
> >>> > > > > Liya Fan
> >>> > > > >
> >>> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <
> >>> wesmck...@gmail.com>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > > hi Liya,
> >>> > > > > >
> >>> > > > > > In C++ we elect certain faster code paths when the null count
> >>> is 0
> >>> > or
> >>> > > > > > computed to be zero. When the null count is 0, we do not
> >>> allocate a
> >>> > > > > > validity bitmap. And there is a "nullable" metadata-only flag
> >>> at
> >>> > the
> >>> > > > > > Field level. Could the same kinds of optimizations be
> >>> implemented
> >>> > in
> >>> > > > > > Java without introducing a "nullable" concept?
> >>> > > > > >
> >>> > > > > > - Wes
> >>> > > > > >
> >>> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <liya.fa...@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > > > > >
> >>> > > > > > > Dear all,
> >>> > > > > > >
> >>> > > > > > > A non-nullable vector is one that is guaranteed to contain no
> >>> > > nulls.
> >>> > > > We
> >>> > > > > > > want to support non-nullable vectors in Java.
> >>> > > > > > >
> >>> > > > > > > *Motivations:*
> >>> > > > > > > 1. It is widely used in practice. For example, in a database
> >>> > > engine,
> >>> > > > a
> >>> > > > > > > column can be declared as not null, so it cannot contain null
> >>> > > values.
> >>> > > > > > > 2.Non-nullable vectors has significant performance advantages
> >>> > > > compared
> >>> > > > > > with
> >>> > > > > > > their nullable conterparts, such as:
> >>> > > > > > >   1) the memory space of the validity buffer can be saved.
> >>> > > > > > >   2) manipulation of the validity buffer can be bypassed
> >>> > > > > > >   3) some if-else branches can be replaced by sequential
> >>> > > instructions
> >>> > > > > (by
> >>> > > > > > > the JIT compiler), leading to high throughput for the CPU
> >>> > pipeline.
> >>> > > > > > >
> >>> > > > > > > *Potential Cost:*
> >>> > > > > > > For nullable vectors, there can be extra checks against the
> >>> > > > > nullablility
> >>> > > > > > > flag. So we must change the code in a way that minimizes the
> >>> > cost.
> >>> > > > > > >
> >>> > > > > > > *Proposed Changes:*
> >>> > > > > > > 1. There is no need to create new vector classes. We add a
> >>> final
> >>> > > > > boolean
> >>> > > > > > to
> >>> > > > > > > the vector base classes as the nullability flag. The value
> >>> of the
> >>> > > > flag
> >>> > > > > > can
> >>> > > > > > > be obtained from the field when creating the vector.
> >>> > > > > > > 2. Add a method "boolean isNullable()" to the root interface
> >>> > > > > ValueVector.
> >>> > > > > > > 3. If a vector is non-nullable, its validity buffer should
> >>> be an
> >>> > > > empty
> >>> > > > > > > buffer (not null, so much of the existing logic can be left
> >>> > > > unchanged).
> >>> > > > > > > 4. For operations involving validity buffers (e.g. isNull,
> >>> get,
> >>> > > set),
> >>> > > > > we
> >>> > > > > > > use the nullability flag to bypass manipulations to the
> >>> validity
> >>> > > > > buffer.
> >>> > > > > > >
> >>> > > > > > > Therefore, it should be possible to support the feature with
> >>> > small
> >>> > > > code
> >>> > > > > > > changes.
> >>> > > > > > >
> >>> > > > > > > BTW, please note that similar behaviors have already been
> >>> > supported
> >>> > > > in
> >>> > > > > > C++.
> >>> > > > > > >
> >>> > > > > > > Would you please give your valueable feedback?
> >>> > > > > > >
> >>> > > > > > > Best,
> >>> > > > > > > Liya Fan
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>

Re: [DISCUSS][Java] Support non-nullable vectors

Reply via email to