Re: [Discuss][Java] 64-bit lengths for ValueVectors

Micah Kornfield Sat, 10 Aug 2019 14:17:45 -0700

Hi Jacques,
What avenue were you thinking for supporting both paths?   I didn't want to
pursue a different class hierarchy, because I felt like that would
effectively fork the code base, but that is potentially an option that
would allow us to have a complete reference implementation in Java that can
fully interact with C++, without major changes to this code.


For supporting both APIs on the same classes/interfaces, I think they
roughly fall into three categories, changes to input parameters, changes to
output parameters and algorithm changes.

For inputs, changing from int to long is essentially a no-op from the
compiler perspective.  From the limited micro-benchmarking this also
doesn't seem to have a performance impact.  So we could keep two versions
of the methods that only differ on inputs, but it is not clear what the
value of that would be.

For outputs, we can't support methods "long getLength()" and "int
getLength()" in the same class, so we would be forced into something like
"long getLength(boolean dummy)" which I think is a less desirable.

For algorithm changes, there did not appear to be too many places where we
actually loop over all elements (it is quite possible I missed something
here), the ones that I did find I was able to mitigate performance
penalties as noted above.  Some of the current implementation will get a
lot slower for "large arrays", but we can likely fix those later or in this
PR with a nested while loop instead of 2 for loops.

Thanks,
Micah


On Saturday, August 10, 2019, Jacques Nadeau <[email protected]> wrote:

> This is a pretty massive change to the apis. I wonder how nasty it would
> be to just support both paths. Have you evaluated how complex that would be?
>
> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <[email protected]>
> wrote:
>
>> After more investigation, it looks like Float8Benchmarks at least on my
>> machine are within the range of noise.
>>
>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>> BitVectorHelper benchmarks back inline (and even with some improvement for
>> getNullCountBenchmark).
>>
>> Benchmark                                        Mode  Cnt   Score   Error
>>  Units
>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ± 0.031
>>  ns/op
>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ± 0.141
>>  ns/op
>>
>> I applied the same pattern to other loops that I could find, and for any
>> "for (long" loop on the critical path, I broke it up into two loops.  the
>> first loop does iteration by integer, the second finishes off for any long
>> values.  As a side note it seems like optimization for loops using long
>> counters at least have a semi-recent open bug for the JVM [2]
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>
>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
>> > difference.  I didn't initially run the benchmarks with these turned on
>> > (you can see the result from above with Float8Benchmarks).  Here are new
>> > numbers including with the flags enabled.  It looks like using longs
>> might
>> > be a little bit slower, I'll see what I can do to mitigate this.
>> >
>> > Ravindra also volunteered to try to benchmark the changes with Dremio's
>> > code on today's sync call.
>> >
>> > New
>> >
>> > Benchmark                                        Mode  Cnt   Score
>>  Error
>> > Units
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   4.176 ±
>> 1.292
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  26.102 ±
>> 0.700
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057  us/op
>> >
>> >
>> >
>> > old
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.828 ±
>> 0.030
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  20.611 ±
>> 0.188
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027  us/op
>> >
>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <[email protected]> wrote:
>> >
>> >> Hi Gonzalo,
>> >>
>> >> Thanks for sharing the performance results.
>> >> I am wondering if you have turned off the flag
>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> >> If not, the lower throughput should be expected.
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <[email protected]
>> >
>> >> wrote:
>> >>
>> >>> Hi Gonzalo,
>> >>> Thank you for the feedback.  I wasn't aware of the JIT implications.
>>  At
>> >>> least on the benchmark run they don't seem to have an impact.
>> >>>
>> >>> If there are other benchmarks that people have that can validate if
>> this
>> >>> change will be problematic I would appreciate trying to run them with
>> the
>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
>> >>> there
>> >>> is a change in those.
>> >>>
>> >>> -Micah
>> >>>
>> >>>
>> >>>
>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>> >>> [email protected]> wrote:
>> >>>
>> >>> > I would recommend to take care with this kind of changes.
>> >>> >
>> >>> > I didn't try Arrow in more than one year, but by then the
>> performance
>> >>> was
>> >>> > quite bad in comparison with plain byte buffer access
>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>> >>> > there are several optimizations that the JVM (specifically, C2) does
>> >>> not
>> >>> > apply when dealing with int instead of longs. One of the
>> >>> > most commons is the loop unrolling and vectorization.
>> >>> >
>> >>> > * It doesn't seem the best way to reference an old email on the
>> list,
>> >>> but
>> >>> > it is the only result shown by Google
>> >>> >
>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<[email protected]>)
>> >>> > escribió:
>> >>> >
>> >>> >> Hi Micah,
>> >>> >>
>> >>> >> Thanks for your effort. The performance result looks good.
>> >>> >>
>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes
>> for
>> >>> each
>> >>> >> of length, write index, and read index).
>> >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>> >>> >> BaseVariableWidthVector, etc.
>> >>> >>
>> >>> >> IMO, such overheads are small enough to justify the change.
>> >>> >> Let's check if there are other overheads.
>> >>> >>
>> >>> >> Best,
>> >>> >> Liya Fan
>> >>> >>
>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
>> [email protected]
>> >>> >
>> >>> >> wrote:
>> >>> >>
>> >>> >> > Hi Liya Fan,
>> >>> >> > Based on the Float8Benchmark there does not seem to be any
>> >>> meaningful
>> >>> >> > performance difference on my machine.  At least for me, the
>> >>> benchmarks
>> >>> >> are
>> >>> >> > not stable enough to say one is faster than the other (I've
>> pasted
>> >>> >> results
>> >>> >> > below).  That being said my machine isn't necessarily the most
>> >>> reliable
>> >>> >> for
>> >>> >> > benchmarking.
>> >>> >> >
>> >>> >> > On an intuitive level, this makes sense to me,  for the most
>> part it
>> >>> >> seems
>> >>> >> > like the change just moves casting from "int" to "long" further
>> up
>> >>> the
>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are other
>> >>> >> benchmarks
>> >>> >> > that you think are worth running let me know.
>> >>> >> >
>> >>> >> > One downside performance wise I think for his change is it
>> >>> increases the
>> >>> >> > size of ArrowBuf objects, which I suppose could influence cache
>> >>> misses
>> >>> >> at
>> >>> >> > some level or increase the size of call-stacks, but this doesn't
>> >>> seem to
>> >>> >> > show up in the benchmark..
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> > Micah
>> >>> >> >
>> >>> >> > Sample benchmark numbers:
>> >>> >> >
>> >>> >> > [New Code]
>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>> >>> Units
>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ± 0.469
>> >>> us/op
>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ± 0.115
>> >>> us/op
>> >>> >> >
>> >>> >> >
>> >>> >> > [Old code]
>> >>> >> > Benchmark                            Mode  Cnt   Score   Error
>> >>> Units
>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ± 1.409
>> >>> us/op
>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ± 0.084
>> >>> us/op
>> >>> >> >
>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <[email protected]>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> Hi Micah,
>> >>> >> >>
>> >>> >> >> Thanks a lot for doing this.
>> >>> >> >>
>> >>> >> >> I am a little concerned about if there is any negative
>> performance
>> >>> >> impact
>> >>> >> >> on the current 32-bit-length based applications.
>> >>> >> >> Can we do some performance comparison on our existing
>> benchmarks?
>> >>> >> >>
>> >>> >> >> Best,
>> >>> >> >> Liya Fan
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
>> >>> [email protected]>
>> >>> >> >> wrote:
>> >>> >> >>
>> >>> >> >>> There have been some previous discussions on the mailing about
>> >>> >> supporting
>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the IPC
>> >>> >> specification
>> >>> >> >>> and C++ support).  I created a PR [1] that changes all APIs
>> that I
>> >>> >> could
>> >>> >> >>> find that take an index to take an "long" instead of an "int"
>> (and
>> >>> >> >>> similarly change "size/rowcount" APIs).
>> >>> >> >>>
>> >>> >> >>> It is a big change, so I think it is worth discussing if it is
>> >>> >> something
>> >>> >> >>> we
>> >>> >> >>> still want to move forward with.  It would be nice to come to a
>> >>> >> >>> conclusion
>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge
>> >>> >> conflicts.
>> >>> >> >>>
>> >>> >> >>> The reason I did this work now is the C++ implementation has
>> added
>> >>> >> >>> support
>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on
>> >>> prior
>> >>> >> >>> discussions we need to have similar support in Java before our
>> >>> next
>> >>> >> >>> release. Support 64-bit indexes means we can have full
>> >>> compatibility
>> >>> >> and
>> >>> >> >>> make the most use of the types in Java.
>> >>> >> >>>
>> >>> >> >>> Look forward to hearing feedback.
>> >>> >> >>>
>> >>> >> >>> Thanks,
>> >>> >> >>> Micah
>> >>> >> >>>
>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
>> >>> >> >>>
>> >>> >> >>
>> >>> >>
>> >>> >
>> >>>
>> >>
>>
>

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Reply via email to