Hi folks,
TLDR: there might be a faster way to compare byte buffers than ByteBufferUtils.
I thought it may be of interest to others if I share my findings about
row key comparison performance. Row key comparison can represent a
significant part of a RegionServer's CPU cycles. In CPU profiles taken
at my company, 2% of CPU cycles is typical, but some extreme cases go
past 20%. If there is a way to speed this up, it would be great. I'm
also aware that the current row key comparator, which uses Java's
Unsafe, is holding back parts of the HBase build on Java 8.
I did some micro-benchmarks, focused on comparing direct ByteBuffers.
I tested HBase's ByteBufferUtils against the JDK's
ByteBuffer.mismatch() method, and against the incubating Vector API.
ByteBufferUtils compares 8 bytes per loop iteration, but both of these
alternative methods can compare more bytes per iteration.
I ran my benchmark on two Amazon Web Services servers. One was type
i8g.large, which has two Neoverse V2 arm64 cores. The other was
i7ie.large, which has two Emerald Rapids x86-64 cores. Both of these
CPUs are about the latest and greatest that you can get in their
respective architectures. My JVM was Temurin-25+36. The benchmark code
is pasted below, which is adapted from a test in the OpenJDK source
tree.
Scores to follow. The scores are the number of byte array comparisons
done per second. Higher is better. I'm just showing results for
buffers of size 257, which is long enough to benefit from
vectorization, but not very long, and I think represents a realistic
row key length. The first half of each buffer being compared always
contains matching bytes, and the second halves do not.
Benchmark Mode Cnt Score Error Units
(arm64)
byteBufferMismatch thrpt 25 86,281,436.335 ± 890778.433 ops/s
# no JVM intrinsic, compares 8 bytes per loop in Java, but C2 may
vectorize it further, not sure
byteBufferUtils thrpt 25 132,360,665.737 ± 765164.120 ops/s
# compares 8 bytes per loop, confirmed that C2 does not vectorize it
further
vector thrpt 25 81,494,739.828 ± 2139303.751 ops/s
# uses NEON registers, compares 16 bytes per loop
(x86-64)
byteBufferMismatch thrpt 25 69,825,596.046 ± 1136949.216 ops/s
# JVM intrinsic, uses AVX512 registers, compares 64 bytes per loop
byteBufferUtils thrpt 25 20,096,803.307 ± 785102.258 ops/s
# compares 8 bytes per loop
vector thrpt 25 69,006,936.897 ± 617447.660 ops/s
# I assume uses AVX512 registers, compares 64 bytes per loop
On x86, ByteBuffer.mismatch() and the Vector API perform the best. On
arm64, ByteBufferUtils performs the best. Every arm64 score is better
than every x86 score, which I guess says good things about the
Neoverse processors. The Vector API doesn't beat
ByteBuffer.mismatch(), so no need to consider that further. I assume
that the most popular architecture for running HBase is x86, so x86
users might appreciate a switch to ByteBuffer.mismatch(), which has
been available since Java 11. Selfishly, this would be bad for my
company, which primarily uses arm64. I want to provide this
information in case it's useful to spark a discussion.
Charles Connell
Hubspot Inc
Benchmark code:
/*
* Copyright (c) 2021, 2024, Oracle and/or its affiliates. All rights reserved.
* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
*
* This code is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 only, as
* published by the Free Software Foundation.
*
* This code is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* version 2 for more details (a copy is included in the LICENSE file that
* accompanied this code).
*
* You should have received a copy of the GNU General Public License version
* 2 along with this work; if not, write to the Free Software Foundation,
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
*
* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
* or visit www.oracle.com if you need additional information or have any
* questions.
*
*/
package cconnell;
import java.io.IOException;
import java.lang.foreign.MemorySegment;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.util.random.RandomGenerator;
import jdk.incubator.vector.ByteVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorOperators;
import jdk.incubator.vector.VectorSpecies;
import org.apache.hadoop.hbase.util.ByteBufferUtils;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.RunnerException;
@State(Scope.Benchmark)
@Fork(jvmArgs = {"--add-modules=jdk.incubator.vector",
"-Dsun.misc.unsafe.memory.access=allow"})
public class ByteArrayCompareBenchmark {
public static void main(String[] args) throws RunnerException, IOException {
org.openjdk.jmh.Main.main(args);
}
@Param({ "10", "257" })
private int size;
private ByteBuffer buffer1;
private ByteBuffer buffer2;
private MemorySegment memorySegment1;
private MemorySegment memorySegment2;
private static final VectorSpecies<Byte> BYTE_SPECIES_PREFERRED =
ByteVector.SPECIES_PREFERRED;
@Setup
public void setup() {
RandomGenerator random = RandomGenerator.getDefault();
int common = (int) (0.5 * size);
byte[] commonBytes = new byte[common];
random.nextBytes(commonBytes);
buffer1 = ByteBuffer.allocateDirect(common + size);
buffer1.put(commonBytes);
byte[] arr = new byte[size];
random.nextBytes(arr);
buffer1.put(arr);
buffer1.flip();
memorySegment1 = MemorySegment.ofBuffer(buffer1);
buffer2 = ByteBuffer.allocateDirect(common + size);
buffer2.put(commonBytes);
arr = new byte[size];
random.nextBytes(arr);
buffer2.put(arr);
buffer2.flip();
memorySegment2 = MemorySegment.ofBuffer(buffer2);
}
@Benchmark
public int byteBufferMismatch() {
int index = buffer1.mismatch(buffer2);
if (index >= 0) {
return (buffer1.get(index) & 255) - (buffer2.get(index) & 255);
} else {
return 0;
}
}
@Benchmark
public int byteBufferUtils() {
return ByteBufferUtils.compareTo(
buffer1,
0,
buffer1.limit(),
buffer2,
0,
buffer2.limit()
);
}
@Benchmark
public int vector() {
int length = Math.min(buffer1.limit(), buffer2.limit());
int offset = 0;
for (
;
offset < BYTE_SPECIES_PREFERRED.loopBound(length);
offset += BYTE_SPECIES_PREFERRED.length()
) {
ByteVector vector1 = ByteVector.fromMemorySegment(
BYTE_SPECIES_PREFERRED,
memorySegment1,
offset,
ByteOrder.LITTLE_ENDIAN
);
ByteVector vector2 = ByteVector.fromMemorySegment(
BYTE_SPECIES_PREFERRED,
memorySegment2,
offset,
ByteOrder.LITTLE_ENDIAN
);
VectorMask<Byte> mask = vector1.compare(VectorOperators.NE, vector2);
if (mask.anyTrue()) {
int index = offset + mask.firstTrue();
return (buffer1.get(index) & 255) - (buffer2.get(index) & 255);
}
}
// process the tail
int mismatch = -1;
for (int i = offset; i < length; ++i) {
if (buffer1.get(i) != buffer2.get(i)) {
mismatch = i;
break;
}
}
if (mismatch >= 0) {
return (buffer1.get(mismatch) & 255) - (buffer2.get(mismatch) & 255);
} else {
return 0;
}
}
}