I assume you accidentally didn't post to the list so I'm quoting your email in full.
On 2021-02-02 Brett Okken wrote: > > while ((i & 3) != 1 && i < end) > > Shouldn't that be (i & 3) != 0? > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1. The idea really is that offset of 1 doesn't enter the loop, thus the main slicing-by-4 loop is misaligned. I don't know why it makes a difference and I'm no longer even sure why I decided to try it. You can try different (i & 3) != { 0, 1, 2, 3 } combinations. > > If I change the buffer size from 8192 to 8191 in XZDecDemo.java, > > then "Modified slicing-by-4" somehow becomes as fast as the > > "Misaligned slicing-by-4". On the surface it sounds weird because > > the buffer still has the same alignment, it's just one byte smaller > > at the end. > > My guess is that this has to do with how many while loops need to be > executed/optimized. > Making it one byte smaller guarantees one of the additional while > loops actually has to execute. Depending on the initial offset, > potentially both need to execute. Maybe you are right, but the confusing thing is that those while-loops are supposedly slower than the for-loop. :-) > > It would be nice if you could compare these too and suggest what > > should be committed. Maybe you can figure out an even better > > version. Different CPU or 32-bit Java or other things may give > > quite different results. > > Truncating the crc to an int 1 time in the loop seems like a clear > winner. I will play with this in my benchmark. > My benchmark is calculating the crc64 of 8k of random bytes. I will > change it to include misaligned read as well. Thanks. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode