You're right, managed to convince myself here as well. Will change it to <=.
An unfortunate side-effect seems to be that in a benchmark with 1024
long array it seems like the performance drops when the tail loop is not
utilized.
As you can see below there is a slight drop when the data is perfectly
aligned in the array causing neither alignment or tail loop to be utilized.
Not going to dig into why right now, but an interesting effect.
@Benchmark
@OperationsPerInvocation(1024)
public void byteArray_1K() {
checksum.update(unalignedByteArray_1k, offset, 1024);
}
Benchmark (offset) Mode Samples Score
Score error Units
o.c.CRC32CUnaligned.byteArray_1K 0 thrpt 5 983198.048
654.467 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 1 thrpt 5 1013136.714
3801.537 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 2 thrpt 5 1006899.265
10998.612 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 3 thrpt 5 1014419.366
3745.875 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 4 thrpt 5 1007777.937
1585.989 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 5 thrpt 5 1011249.564
852.333 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 6 thrpt 5 1013622.327
4075.303 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 7 thrpt 5 1008137.616
1672.262 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 8 thrpt 5 985838.295
448.735 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 9 thrpt 5 1005935.356
4250.381 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 10 thrpt 5 995912.149
2447.272 ops/ms
With just using < then the performance is equal across all alignments.
Benchmark (offset) Mode Samples Score
Score error Units
o.c.CRC32CUnaligned.byteArray_1K 0 thrpt 5 1003094.419
2656.518 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 1 thrpt 5 1009641.996
2519.614 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 2 thrpt 5 1009169.464
12987.786 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 3 thrpt 5 1006869.144
1617.988 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 4 thrpt 5 1005490.644
1234.283 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 5 thrpt 5 1007468.090
327.541 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 6 thrpt 5 1014316.244
1174.920 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 7 thrpt 5 1008378.002
1440.734 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 8 thrpt 5 1003253.408
10786.210 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 9 thrpt 5 1000950.362
250.617 ops/ms
o.c.CRC32CUnaligned.byteArray_1K 10 thrpt 5 1003783.897
11333.132 ops/ms
//Staffan
On 10/21/2014 10:56 PM, Peter Levart wrote:
On 10/21/2014 11:34 PM, Staffan Friberg wrote:
I believe it must be <, as it is in the tail loop as well, because
end is (off+len or limit) so end is exclusive, similar to
subString(begin,end).
Makes sense?
//Staffan
On 10/21/2014 01:46 PM, Peter Levart wrote:
Sorry Staffan, another nit...
212 for (; off < (end - Long.BYTES); off += Long.BYTES) {
and
286 for (; off < (end - Long.BYTES); off += Long.BYTES) {
I think you could change "off < (end - Long.BYTES)" to "off <= (end
- Long.BYTES)". Am I right?
The tail loop has < :
319 for (; off < end; off++) {
...but it could be written as:
for (; off <= (end - Byte.BYTES); off += Byte.BYTES) { ...
;-)
In other words, when off == end - Long.BYTES, you can still read
Long.BYTES starting at 'off' .
Regards, Peter
Regards, Peter
On 10/21/2014 10:30 PM, Peter Levart wrote:
On 10/21/2014 08:49 PM, Staffan Friberg wrote:
Hi,
Got an offline comment that the package.html should be update as
well to cover CRC-32C.
Otherwise there are no code changes in this new webrev.
http://cr.openjdk.java.net/~sfriberg/JDK-6321472/webrev.04
Hi Staffan,
198 if (end - off >= 8 &&
Unsafe.ARRAY_BOOLEAN_INDEX_SCALE == 1) {
ARRAY_BOOLEAN_INDEX_SCALE -> ARRAY_BYTE_INDEX_SCALE
Otherwise looks good now.
Regards, Peter
P.S.
I think (by looking at DirectByteBuffer.asIntBuffer()
implementation) that when doing 32 bit (4 byte) reads using Unsafe,
the address only has to be aligned to 4 bytes (8 is necessary
alignment for 64 bit reads). So updateDirectByteBuffer could make
this alignment on 4 bytes as it's only using 32 bit reads (with
additional check on ADDRESS_SIZE, you could do that for updateBytes
too).
You don't get much out of it, so you decide if it's worth
complication.
//Staffan
On 10/21/2014 10:28 AM, Staffan Friberg wrote:
Hi Peter,
Thanks for the comments..
217 if (Unsafe.ADDRESS_SIZE == 4) {
218 // On 32 bit platforms read two ints
instead of a single 64bit long
When you're reading from byte[] using Unsafe (updateBytes), you
have the option of reading 64bit values on 64bit platforms. When
you're reading from DirectByteBuffer memory
(updateDirectByteBuffer), you're only using 32bit reads.
I will add a comment in the code for this decision. The reason is
that read a long results in slightly worse performance in this
case, in updateBytes it is faster. I was able to get it to run
slightly faster by working directly with the address instead of
always adding address + off, but this makes things worse in the
32bit case since all calculation will now be using long
variables. So using the getInt as in the current code feels like
the best solution as it strikes the best balance between 32 and
64bit. Below is how updateByteBuffer looked with the rewrite I
mentioned.
ong address = ((DirectBuffer) buffer).address();
crc = updateDirectByteBuffer(crc, address + pos, address + limit);
private static int updateDirectByteBuffer(int crc, long adr,
long end) {
// Do only byte reads for arrays so short they can't be
aligned
if (end - adr >= 8) {
// align on 8 bytes
int alignLength = (8 - (int) (adr & 0x7)) & 0x7;
for (long alignEnd = adr + alignLength; adr <
alignEnd; adr++) {
crc = (crc >>> 8)
^ byteTable[(crc ^ UNSAFE.getByte(adr)) &
0xFF];
}
if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) {
crc = Integer.reverseBytes(crc);
}
// slicing-by-8
for (; adr < (end - Long.BYTES); adr += Long.BYTES) {
int firstHalf;
int secondHalf;
if (Unsafe.ADDRESS_SIZE == 4) {
// On 32 bit platforms read two ints instead
of a single 64bit long
firstHalf = UNSAFE.getInt(adr);
secondHalf = UNSAFE.getInt(adr + Integer.BYTES);
} else {
long value = UNSAFE.getLong(adr);
if (ByteOrder.nativeOrder() ==
ByteOrder.LITTLE_ENDIAN) {
firstHalf = (int) value;
secondHalf = (int) (value >>> 32);
} else { // ByteOrder.BIG_ENDIAN
firstHalf = (int) (value >>> 32);
secondHalf = (int) value;
}
}
crc ^= firstHalf;
if (ByteOrder.nativeOrder() ==
ByteOrder.LITTLE_ENDIAN) {
crc = byteTable7[crc & 0xFF]
^ byteTable6[(crc >>> 8) & 0xFF]
^ byteTable5[(crc >>> 16) & 0xFF]
^ byteTable4[crc >>> 24]
^ byteTable3[secondHalf & 0xFF]
^ byteTable2[(secondHalf >>> 8) & 0xFF]
^ byteTable1[(secondHalf >>> 16) & 0xFF]
^ byteTable0[secondHalf >>> 24];
} else { // ByteOrder.BIG_ENDIAN
crc = byteTable0[secondHalf & 0xFF]
^ byteTable1[(secondHalf >>> 8) & 0xFF]
^ byteTable2[(secondHalf >>> 16) & 0xFF]
^ byteTable3[secondHalf >>> 24]
^ byteTable4[crc & 0xFF]
^ byteTable5[(crc >>> 8) & 0xFF]
^ byteTable6[(crc >>> 16) & 0xFF]
^ byteTable7[crc >>> 24];
}
}
if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) {
crc = Integer.reverseBytes(crc);
}
}
// Tail
for (; adr < end; adr++) {
crc = (crc >>> 8)
^ byteTable[(crc ^ UNSAFE.getByte(adr)) & 0xFF];
}
return crc;
}
Also, in updateBytes, the usage of
Unsafe.ARRAY_INT_INDEX_SCALE/ARRAY_LONG_INDEX_SCALE to index a
byte array sounds a little scary. To be ultra portable you could
check that ARRAY_BYTE_INDEX_SCALE == 1 first and refuse to use
Unsafe for byte arrays if it is not 1. Then use
Integer.BYTES/Long.BYTES to manipulate 'offsets' instead. In
updateDirectByteBuffer it would be more appropriate to use
Integer.BYTES/Long.BYTES too.
Good idea. Added a check in the initial if statement and it will
get automatically optimized away.
225 firstHalf = (int) (value & 0xFFFFFFFF);
226 secondHalf = (int) (value >>> 32);
227 } else { // ByteOrder.BIG_ENDIAN
228 firstHalf = (int) (value >>> 32);
229 secondHalf = (int) (value &
0xFFFFFFFF);
firstHalf = (int) value; // this is equivalent for line 225
secondHalf = (int) value; // this is equivalent for line 229
Done.
Here is the latest webrev,
http://cr.openjdk.java.net/~sfriberg/JDK-6321472/webrev.03
Cheers,
Staffan