After more investigation, it looks like Float8Benchmarks at least on my machine are within the range of noise.
For BitVectorHelper I pushed a new commit [1], seems to bring the BitVectorHelper benchmarks back inline (and even with some improvement for getNullCountBenchmark). Benchmark Mode Cnt Score Error Units BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 3.821 ± 0.031 ns/op BitVectorHelperBenchmarks.getNullCountBenchmark avgt 5 14.884 ± 0.141 ns/op I applied the same pattern to other loops that I could find, and for any "for (long" loop on the critical path, I broke it up into two loops. the first loop does iteration by integer, the second finishes off for any long values. As a side note it seems like optimization for loops using long counters at least have a semi-recent open bug for the JVM [2] Thanks, Micah [1] https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797 [2] https://bugs.openjdk.java.net/browse/JDK-8223051 On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Indeed, the BoundChecking and CheckNullForGet variables can make a big > difference. I didn't initially run the benchmarks with these turned on > (you can see the result from above with Float8Benchmarks). Here are new > numbers including with the flags enabled. It looks like using longs might > be a little bit slower, I'll see what I can do to mitigate this. > > Ravindra also volunteered to try to benchmark the changes with Dremio's > code on today's sync call. > > New > > Benchmark Mode Cnt Score Error > Units > > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 4.176 ± 1.292 > ns/op > > BitVectorHelperBenchmarks.getNullCountBenchmark avgt 5 26.102 ± 0.700 > ns/op > > Float8Benchmarks.copyFromBenchmark avgt 5 7.398 ± 0.084 us/op > > Float8Benchmarks.readWriteBenchmark avgt 5 2.711 ± 0.057 us/op > > > > old > > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt 5 3.828 ± 0.030 > ns/op > > BitVectorHelperBenchmarks.getNullCountBenchmark avgt 5 20.611 ± 0.188 > ns/op > > Float8Benchmarks.copyFromBenchmark avgt 5 6.597 ± 0.462 us/op > > Float8Benchmarks.readWriteBenchmark avgt 5 2.615 ± 0.027 us/op > > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <liya.fa...@gmail.com> wrote: > >> Hi Gonzalo, >> >> Thanks for sharing the performance results. >> I am wondering if you have turned off the flag >> BoundsChecking#BOUNDS_CHECKING_ENABLED. >> If not, the lower throughput should be expected. >> >> Best, >> Liya Fan >> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Hi Gonzalo, >>> Thank you for the feedback. I wasn't aware of the JIT implications. At >>> least on the benchmark run they don't seem to have an impact. >>> >>> If there are other benchmarks that people have that can validate if this >>> change will be problematic I would appreciate trying to run them with the >>> PR. I will try to run the ones for zeroing/popcnt tonight to see if >>> there >>> is a change in those. >>> >>> -Micah >>> >>> >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar < >>> golthir...@gmail.com> wrote: >>> >>> > I would recommend to take care with this kind of changes. >>> > >>> > I didn't try Arrow in more than one year, but by then the performance >>> was >>> > quite bad in comparison with plain byte buffer access >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and >>> > there are several optimizations that the JVM (specifically, C2) does >>> not >>> > apply when dealing with int instead of longs. One of the >>> > most commons is the loop unrolling and vectorization. >>> > >>> > * It doesn't seem the best way to reference an old email on the list, >>> but >>> > it is the only result shown by Google >>> > >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<liya.fa...@gmail.com>) >>> > escribió: >>> > >>> >> Hi Micah, >>> >> >>> >> Thanks for your effort. The performance result looks good. >>> >> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for >>> each >>> >> of length, write index, and read index). >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector, >>> >> BaseVariableWidthVector, etc. >>> >> >>> >> IMO, such overheads are small enough to justify the change. >>> >> Let's check if there are other overheads. >>> >> >>> >> Best, >>> >> Liya Fan >>> >> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <emkornfi...@gmail.com >>> > >>> >> wrote: >>> >> >>> >> > Hi Liya Fan, >>> >> > Based on the Float8Benchmark there does not seem to be any >>> meaningful >>> >> > performance difference on my machine. At least for me, the >>> benchmarks >>> >> are >>> >> > not stable enough to say one is faster than the other (I've pasted >>> >> results >>> >> > below). That being said my machine isn't necessarily the most >>> reliable >>> >> for >>> >> > benchmarking. >>> >> > >>> >> > On an intuitive level, this makes sense to me, for the most part it >>> >> seems >>> >> > like the change just moves casting from "int" to "long" further up >>> the >>> >> > stack for "PlatformDepdendent" operations. If there are other >>> >> benchmarks >>> >> > that you think are worth running let me know. >>> >> > >>> >> > One downside performance wise I think for his change is it >>> increases the >>> >> > size of ArrowBuf objects, which I suppose could influence cache >>> misses >>> >> at >>> >> > some level or increase the size of call-stacks, but this doesn't >>> seem to >>> >> > show up in the benchmark.. >>> >> > >>> >> > Thanks, >>> >> > Micah >>> >> > >>> >> > Sample benchmark numbers: >>> >> > >>> >> > [New Code] >>> >> > Benchmark Mode Cnt Score Error >>> Units >>> >> > Float8Benchmarks.copyFromBenchmark avgt 5 15.441 ± 0.469 >>> us/op >>> >> > Float8Benchmarks.readWriteBenchmark avgt 5 14.057 ± 0.115 >>> us/op >>> >> > >>> >> > >>> >> > [Old code] >>> >> > Benchmark Mode Cnt Score Error >>> Units >>> >> > Float8Benchmarks.copyFromBenchmark avgt 5 16.248 ± 1.409 >>> us/op >>> >> > Float8Benchmarks.readWriteBenchmark avgt 5 14.150 ± 0.084 >>> us/op >>> >> > >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <liya.fa...@gmail.com> >>> wrote: >>> >> > >>> >> >> Hi Micah, >>> >> >> >>> >> >> Thanks a lot for doing this. >>> >> >> >>> >> >> I am a little concerned about if there is any negative performance >>> >> impact >>> >> >> on the current 32-bit-length based applications. >>> >> >> Can we do some performance comparison on our existing benchmarks? >>> >> >> >>> >> >> Best, >>> >> >> Liya Fan >>> >> >> >>> >> >> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield < >>> emkornfi...@gmail.com> >>> >> >> wrote: >>> >> >> >>> >> >>> There have been some previous discussions on the mailing about >>> >> supporting >>> >> >>> 64-bit lengths for Java ValueVectors (this is what the IPC >>> >> specification >>> >> >>> and C++ support). I created a PR [1] that changes all APIs that I >>> >> could >>> >> >>> find that take an index to take an "long" instead of an "int" (and >>> >> >>> similarly change "size/rowcount" APIs). >>> >> >>> >>> >> >>> It is a big change, so I think it is worth discussing if it is >>> >> something >>> >> >>> we >>> >> >>> still want to move forward with. It would be nice to come to a >>> >> >>> conclusion >>> >> >>> quickly, ideally in the next few days, to avoid a lot of merge >>> >> conflicts. >>> >> >>> >>> >> >>> The reason I did this work now is the C++ implementation has added >>> >> >>> support >>> >> >>> for LargeList, LargeBinary and LargeString arrays and based on >>> prior >>> >> >>> discussions we need to have similar support in Java before our >>> next >>> >> >>> release. Support 64-bit indexes means we can have full >>> compatibility >>> >> and >>> >> >>> make the most use of the types in Java. >>> >> >>> >>> >> >>> Look forward to hearing feedback. >>> >> >>> >>> >> >>> Thanks, >>> >> >>> Micah >>> >> >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020 >>> >> >>> >>> >> >> >>> >> >>> > >>> >>