I accidentally hit reply instead of reply all. > > Shouldn't that be (i & 3) != 0? > > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1. > > The idea really is that offset of 1 doesn't enter the loop, thus the > main slicing-by-4 loop is misaligned. I don't know why it makes a > difference and I'm no longer even sure why I decided to try it. You can > try different (i & 3) != { 0, 1, 2, 3 } combinations.
I misunderstood your intent. I thought you were intending to get the for loop onto 4 byte alignment. I updated the benchmark to test with offsets [0,1,2] and also reducing the length by an additional [0,1,2]. This should provide a good mix of content which could require alignment at beginning and extra bytes at the end. Thus far I have only tested on jdk 11 64bit windows, but the fairly clear winner is: public void update(byte[] buf, int off, int len) { final int end = off + len; int i=off; if (len > 3) { switch (i & 3) { case 3: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 2: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 1: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); } for (int j = end - 3; i < j; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } } switch ((end-i) & 3) { case 3: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 2: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); case 1: crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ (crc >>> 8); } } Brett