Micro-benchmarks for ntz and pop (BitUtils) operations.
-------------------------------------------------------
Key: LUCENE-2221
URL: https://issues.apache.org/jira/browse/LUCENE-2221
Project: Lucene - Java
Issue Type: Task
Components: Other
Reporter: Dawid Weiss
Priority: Trivial
As suggested by Yonik, I performed a suite of micro-benchmarks to investigate
the following:
* pop() (bitCount) seems to be implemented in the same way ("hacker's delight")
as in the BitUtils class (SUN's standard library from version 1.5).
* native intrinsics have been recently added to the HotSpot that should speed
up bitCount significantly.
I have tried to run the code on various VMs and architectures, but of course
the results may vary depending on the setting.
h2. Micro-benchmark code, evaluation
* The micro-benchmarks were written as JUnit tests with a custom set of rules
that repeats each test measuring execution time, gc use, etc.
* There were 5 warmup runs before each test, followed by 15 benchmarked runs.
The result contain overall times, round times and standard deviations where
applicable.
* There were several tests for isolated performance of {{BitUtil.pop()}}, JDK's
{{bitCount}}, {{BitUtil.ntz()}}, {{BitUtil.ntz2()}}, {{BitUtil.ntz3()}} and
JDK's {{numberOfTrailingZeros}}, the test code had the following loop:
{code}
final long [] longs = NtzPopBenchmark.random.bits;
int cnt = 0;
for (int i = longs.length; --i >= 0;)
{
cnt += Long.bitCount(longs[i]);
}
volatileVariable = cnt; // to prevent dead code removal.
{code}
* I also added another version of pop() based on a precomputed bit counts. This
version was called {{pop2}}.
* The input array of long values was initialized to a memory taking 200MB.
There were two different sets: {{random}} (random values) and {{single}}
(single bit rotated to the right in each long).
* There were tests that called {{BitUtil.pop_xor}} between the two input
bitsets (random, single).
* Additional tests that used iterators and and other BitUtil operations showed
similar performance to isolated loops, so I omit them here.
h2. Evaluation environment
I tested on three different machines:
* Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
* AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
* Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)
and on various VMs:
* 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
* 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
* 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
* 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
* BEA JRockit.
* (ant other minor versions of the VMs above, depending on the computer).
h2. Results overview
h3. {{pop}}
The times between {{BitUtil}} and JDK were mostly identical. However, on 32-bit
systems, precached {{pop2}} performed
*much* better. Examples:
{noformat}
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 15.61, time.warmup: 4.31,
time.bench: 11.30, round: 0.75 [+- 0.02]
test_POP_JDK_single : 15/20 rounds, time.total: 15.67, time.warmup: 4.31,
time.bench: 11.36, round: 0.76 [+- 0.02]
test_POP_BitUtil_random : 15/20 rounds, time.total: 15.55, time.warmup: 4.33,
time.bench: 11.22, round: 0.75 [+- 0.01]
test_POP_BitUtil_single : 15/20 rounds, time.total: 15.55, time.warmup: 4.31,
time.bench: 11.23, round: 0.75 [+- 0.01]
test_POP2_random : 15/20 rounds, time.total: 6.69, time.warmup: 1.75,
time.bench: 4.94, round: 0.33 [+- 0.00]
test_POP2_single : 15/20 rounds, time.total: 4.66, time.warmup: 1.22,
time.bench: 3.44, round: 0.23 [+- 0.01]
{noformat}
Note the difference between random and single distributions -- most probably
due to more cache hits when referring to the
lookup table. Other VMs on this 32-bit machine:
{noformat}
# 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 20.67, time.warmup: 5.19,
time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP_JDK_single : 15/20 rounds, time.total: 22.70, time.warmup: 5.63,
time.bench: 17.08, round: 1.14 [+- 0.01]
test_POP_BitUtil_random : 15/20 rounds, time.total: 22.69, time.warmup: 5.63,
time.bench: 17.06, round: 1.14 [+- 0.01]
test_POP_BitUtil_single : 15/20 rounds, time.total: 20.67, time.warmup: 5.19,
time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP2_random : 15/20 rounds, time.total: 6.30, time.warmup: 1.63,
time.bench: 4.67, round: 0.31 [+- 0.01]
test_POP2_single : 15/20 rounds, time.total: 4.33, time.warmup: 1.16,
time.bench: 3.17, round: 0.21 [+- 0.01]
# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 15.28, time.warmup: 4.25,
time.bench: 11.03, round: 0.74 [+- 0.03]
test_POP_JDK_single : 15/20 rounds, time.total: 15.16, time.warmup: 4.20,
time.bench: 10.95, round: 0.73 [+- 0.01]
test_POP_BitUtil_random : 15/20 rounds, time.total: 15.12, time.warmup: 4.20,
time.bench: 10.92, round: 0.73 [+- 0.01]
test_POP_BitUtil_single : 15/20 rounds, time.total: 15.13, time.warmup: 4.25,
time.bench: 10.88, round: 0.73 [+- 0.01]
test_POP2_random : 15/20 rounds, time.total: 6.78, time.warmup: 1.72,
time.bench: 5.06, round: 0.34 [+- 0.01]
test_POP2_single : 15/20 rounds, time.total: 4.72, time.warmup: 1.20,
time.bench: 3.52, round: 0.23 [+- 0.02]
{noformat}
On 64-bit machines, the results are nearly equal, with pop2 performing slightly
worse on SUN's 1.6 compared to JDK and BitUtil
(but this difference is really tiny and not present on all VMs; see IBM J9 and
SUN's 1.5 below).
{noformat}
# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 3.27, time.warmup: 0.81,
time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_JDK_single : 15/20 rounds, time.total: 3.11, time.warmup: 0.76,
time.bench: 2.34, round: 0.16 [+- 0.02]
test_POP_BitUtil_random : 15/20 rounds, time.total: 3.27, time.warmup: 0.81,
time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_BitUtil_single : 15/20 rounds, time.total: 3.03, time.warmup: 0.77,
time.bench: 2.26, round: 0.15 [+- 0.00]
test_POP2_random : 15/20 rounds, time.total: 3.63, time.warmup: 0.93,
time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_single : 15/20 rounds, time.total: 3.51, time.warmup: 0.89,
time.bench: 2.62, round: 0.17 [+- 0.00]
# 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
test_POP_JDK_random : 15/20 rounds, time.total: 4.80, time.warmup: 1.24,
time.bench: 3.57, round: 0.24 [+- 0.01]
test_POP_JDK_single : 15/20 rounds, time.total: 5.00, time.warmup: 1.44,
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_random : 15/20 rounds, time.total: 4.81, time.warmup: 1.24,
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_single : 15/20 rounds, time.total: 4.75, time.warmup: 1.19,
time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP2_random : 15/20 rounds, time.total: 3.65, time.warmup: 0.90,
time.bench: 2.76, round: 0.18 [+- 0.00]
test_POP2_single : 15/20 rounds, time.total: 3.82, time.warmup: 0.93,
time.bench: 2.89, round: 0.19 [+- 0.01]
# 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems
Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 3.72, time.warmup: 0.94,
time.bench: 2.78, round: 0.19 [+- 0.00]
test_POP_JDK_single : 15/20 rounds, time.total: 5.96, time.warmup: 1.40,
time.bench: 4.56, round: 0.30 [+- 0.00]
test_POP_BitUtil_random : 15/20 rounds, time.total: 6.16, time.warmup: 1.43,
time.bench: 4.73, round: 0.31 [+- 0.00]
test_POP_BitUtil_single : 15/20 rounds, time.total: 3.62, time.warmup: 0.92,
time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_random : 15/20 rounds, time.total: 3.70, time.warmup: 0.96,
time.bench: 2.74, round: 0.18 [+- 0.00]
test_POP2_single : 15/20 rounds, time.total: 3.57, time.warmup: 0.93,
time.bench: 2.65, round: 0.18 [+- 0.00]
{noformat}
The other 64-bit machine (quad-core):
{noformat}
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_POP_JDK_random : 15/20 rounds, time.total: 2.46, time.warmup: 0.62,
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_JDK_single : 15/20 rounds, time.total: 2.49, time.warmup: 0.62,
time.bench: 1.87, round: 0.12 [+- 0.01]
test_POP_BitUtil_random : 15/20 rounds, time.total: 2.47, time.warmup: 0.62,
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_BitUtil_single : 15/20 rounds, time.total: 2.46, time.warmup: 0.62,
time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP2_random : 15/20 rounds, time.total: 2.82, time.warmup: 0.71,
time.bench: 2.11, round: 0.14 [+- 0.00]
test_POP2_single : 15/20 rounds, time.total: 2.15, time.warmup: 0.55,
time.bench: 1.61, round: 0.11 [+- 0.00]
{noformat}
I then replaced {{BitUtil.pop}} with {{BitUtil.pop2}} in bit-counting methods
like xor/and/or. The results are intriguing.
On 32-bit systems, there is a measureable gain, like here:
{noformat}
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
test_pop_xor : 15/20 rounds, time.total: 9.78, time.warmup: 2.59,
time.bench: 7.19, round: 0.48 [+- 0.01]
test_pop2_hd_xor : 15/20 rounds, time.total: 8.27, time.warmup: 2.22,
time.bench: 6.05, round: 0.40 [+- 0.01]
# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
test_pop_xor : 15/20 rounds, time.total: 9.89, time.warmup: 2.59,
time.bench: 7.30, round: 0.49 [+- 0.02]
test_pop2_hd_xor : 15/20 rounds, time.total: 8.20, time.warmup: 2.24,
time.bench: 5.97, round: 0.40 [+- 0.01]
{noformat}
On 64-bit systems, when 64-bit values can be manipulated directly in registers,
there was nearly no speedup or even
a small performance penalty like in here:
{noformat}
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_pop_xor : 15/20 rounds, time.total: 1.76, time.warmup: 0.49,
time.bench: 1.27, round: 0.09 [+- 0.00]
test_pop2_hd_xor : 15/20 rounds, time.total: 2.06, time.warmup: 0.55,
time.bench: 1.51, round: 0.10 [+- 0.00]
{noformat}
I'm guessing referencing memory on this fast processors is slower than
manipulating registers.
h3. {{ntz}}
On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my
tests than the one from JDK,
but it also has a greater variance depending on the input bits' distribution
(compare the same routine
for random and single below).
{noformat}
# 32-bit system;
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
test_NTZ_JDK_random : 15/20 rounds, time.total: 6.69, time.warmup: 1.73,
time.bench: 4.95, round: 0.33 [+- 0.01]
test_NTZ_JDK_single : 15/20 rounds, time.total: 7.59, time.warmup: 1.94,
time.bench: 5.66, round: 0.38 [+- 0.01]
test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.72, time.warmup: 0.73,
time.bench: 1.98, round: 0.13 [+- 0.02]
test_NTZ_BitUtil_single : 15/20 rounds, time.total: 5.28, time.warmup: 1.34,
time.bench: 3.94, round: 0.26 [+- 0.02]
test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 3.06, time.warmup: 0.81,
time.bench: 2.25, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 5.36, time.warmup: 1.34,
time.bench: 4.02, round: 0.27 [+- 0.01]
test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 5.80, time.warmup: 1.48,
time.bench: 4.31, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 6.98, time.warmup: 1.81,
time.bench: 5.17, round: 0.34 [+- 0.01]
# 64-bit Athlon
# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_NTZ_JDK_random : 15/20 rounds, time.total: 4.59, time.warmup: 1.16,
time.bench: 3.44, round: 0.23 [+- 0.00]
test_NTZ_JDK_single : 15/20 rounds, time.total: 6.64, time.warmup: 1.59,
time.bench: 5.04, round: 0.34 [+- 0.01]
test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.09, time.warmup: 0.53,
time.bench: 1.56, round: 0.10 [+- 0.00]
test_NTZ_BitUtil_single : 15/20 rounds, time.total: 3.87, time.warmup: 0.98,
time.bench: 2.90, round: 0.19 [+- 0.00]
test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 2.09, time.warmup: 0.52,
time.bench: 1.57, round: 0.10 [+- 0.00]
test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 3.31, time.warmup: 0.84,
time.bench: 2.47, round: 0.16 [+- 0.00]
test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 3.31, time.warmup: 0.83,
time.bench: 2.48, round: 0.17 [+- 0.00]
test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 5.71, time.warmup: 1.39,
time.bench: 4.32, round: 0.29 [+- 0.00]
{noformat}
But then comes the 1.7 HotSport and things change radically, on 32-bit system
the JDK's version is much faster for nearly-empty
{{long}} values:
{noformat}
# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
test_NTZ_JDK_random : 15/20 rounds, time.total: 1.97, time.warmup: 0.61,
time.bench: 1.36, round: 0.09 [+- 0.01]
test_NTZ_JDK_single : 15/20 rounds, time.total: 2.53, time.warmup: 0.77,
time.bench: 1.77, round: 0.12 [+- 0.01]
test_NTZ_BitUtil_random : 15/20 rounds, time.total: 2.36, time.warmup: 0.66,
time.bench: 1.70, round: 0.11 [+- 0.01]
test_NTZ_BitUtil_single : 15/20 rounds, time.total: 4.50, time.warmup: 1.19,
time.bench: 3.31, round: 0.22 [+- 0.01]
test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 3.08, time.warmup: 0.81,
time.bench: 2.27, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 4.97, time.warmup: 1.28,
time.bench: 3.69, round: 0.25 [+- 0.01]
test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 5.78, time.warmup: 1.48,
time.bench: 4.30, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 7.77, time.warmup: 1.91,
time.bench: 5.86, round: 0.39 [+- 0.01]
{noformat}
On the 64-bit quad core:
{noformat}
# 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
test_NTZ_JDK_random : 15/20 rounds, time.total: 3.92, time.warmup: 0.97,
time.bench: 2.94, round: 0.20 [+- 0.00]
test_NTZ_JDK_single : 15/20 rounds, time.total: 3.80, time.warmup: 0.97,
time.bench: 2.82, round: 0.19 [+- 0.00]
test_NTZ_BitUtil_random : 15/20 rounds, time.total: 0.96, time.warmup: 0.25,
time.bench: 0.71, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single : 15/20 rounds, time.total: 2.74, time.warmup: 0.69,
time.bench: 2.04, round: 0.14 [+- 0.00]
test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 1.22, time.warmup: 0.31,
time.bench: 0.91, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 2.18, time.warmup: 0.56,
time.bench: 1.62, round: 0.11 [+- 0.00]
test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 2.76, time.warmup: 0.71,
time.bench: 2.06, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 3.47, time.warmup: 0.91,
time.bench: 2.56, round: 0.17 [+- 0.01]
{noformat}
And then comes the 1.7, compare JDK's implementation with anything else
(especially the {{time.bench}} for
the {{single}} input data. Looks like this is hardware-accelerated.
{noformat}
# -server -Xbatch -Xmx1024m
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_NTZ_JDK_random : 15/20 rounds, time.total: 0.79, time.warmup: 0.21,
time.bench: 0.58, round: 0.04 [+- 0.00]
test_NTZ_JDK_single : 15/20 rounds, time.total: 0.75, time.warmup: 0.20,
time.bench: 0.55, round: 0.04 [+- 0.00]
test_NTZ_BitUtil_random : 15/20 rounds, time.total: 0.98, time.warmup: 0.25,
time.bench: 0.72, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single : 15/20 rounds, time.total: 2.61, time.warmup: 0.66,
time.bench: 1.95, round: 0.13 [+- 0.00]
test_NTZ2_BitUtil_random : 15/20 rounds, time.total: 1.30, time.warmup: 0.33,
time.bench: 0.97, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single : 15/20 rounds, time.total: 2.48, time.warmup: 0.61,
time.bench: 1.88, round: 0.13 [+- 0.00]
test_NTZ3_BitUtil_random : 15/20 rounds, time.total: 2.81, time.warmup: 0.70,
time.bench: 2.11, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single : 15/20 rounds, time.total: 4.07, time.warmup: 1.02,
time.bench: 3.05, round: 0.20 [+- 0.00]
{noformat}
h2. Conclusions
It seems that any change introduced to these routines will hurt somebody in
some configuration, so it's really hard
for me to make choices. I would definitely opt for the precached {{pop2}}
version on 32-bit systems as it seems to
be always faster or equally fast compared to other bit counting options.
{{pop2}} looked like this:
{code}
private static byte [] bcounts = new byte [0x10000];
static
{
for (int i = 0x10000; --i >= 0;)
bcounts[i] = (byte) Integer.bitCount(i);
}
public static int pop2(long v)
{
int t;
return
bcounts[(t = (int) v) & 0xffff]
+ bcounts[t >>> 16]
+ bcounts[(t = ((int) (v >>> 32))) & 0xffff]
+ bcounts[t >>> 16];
}
{code}
As for the hardware-accelerated {{ntz}}, if one can detect 1.7, then using the
JDK's version is speeding up things
significantly. But I have not checked how this detection would affect speed if
done at run-time (I assume a final
static flag wouldn't cause any performance penalty) and it is definitely not
worth replacing for folks with older
VMs.
h2. Raw results data.
I will attach raw results as part of the issue if you want to draw your own
conclusions. Didn't have access to sparc-machine
or to any machine with the newest Intels.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]