Hello! I need to make a new release in the near future so that a minor problem can be fixed in .7z support in Apache Commons Compress. I thought I could include simpler and safer changes from your long list of patches and the CRC64 improvement might be such.
On 2021-01-21 Brett Okken wrote: > Here is a slice by 4 implementation. It goes byte by byte to easily be > compatible with older jdks. Performance wise, it is pretty comparable > to the java port of Adler's stackoverflow implementation: > > Benchmark Mode Cnt Score Error Units > Hash64Benchmark.adler avgt 5 6850.172 ± 251.528 ns/op > Hash64Benchmark.crc64 avgt 5 16347.986 ± 53.702 ns/op > Hash64Benchmark.slice4 avgt 5 6842.010 ± 393.149 ns/op Thank you! I played around a bit. Seems that the code is *really* sensitive to tiny changes. It's possible that it depends on the computer and such things too; I only tried on one machine. I timed decompression of gigabyte of null bytes using XZDecDemo and OpenJDK 15 on x86-64. This isn't very accurate but it's enough to sort them: Original 6.8 s Modified original 6.2 s Your slicing-by-4 5.8 s Modified slicing-by-4 5.6 s Misaligned slicing-by-4 5.2 s xz -t 3.6 s Modified original: --- a/src/org/tukaani/xz/check/CRC64.java +++ b/src/org/tukaani/xz/check/CRC64.java @@ -38,7 +38,8 @@ public class CRC64 extends Check { int end = off + len; while (off < end) - crc = crcTable[(buf[off++] ^ (int)crc) & 0xFF] ^ (crc >>> 8); + crc = crcTable[(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] + ^ (crc >>> 8); } public byte[] finish() { Modified slicing-by-4: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } Misaligned slicing-by-4 adds an extra while-loop to the beginning: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; while ((i & 3) != 1 && i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } If I change the buffer size from 8192 to 8191 in XZDecDemo.java, then "Modified slicing-by-4" somehow becomes as fast as the "Misaligned slicing-by-4". On the surface it sounds weird because the buffer still has the same alignment, it's just one byte smaller at the end. The same thing happens too if the buffer size is kept at 8192 but first byte isn't used (making the beginning of the buffer misaligned). Moving the "(crc32 >> 32)" to a different position in the xor sequence can affect things too... it's almost spooky. ;-) It would be nice if you could compare these too and suggest what should be committed. Maybe you can figure out an even better version. Different CPU or 32-bit Java or other things may give quite different results. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode