I've been measuring the performance after this patch, and as you might expect it's always much better with UseUnalignedAccesses.
However, we can sometimes get performance regressions, albeit in some fairly contrived cases. I have a test which repeatedly loads a {long,int,short} at some random offset in a ByteBuffer, XORs some random value into it, and stores the result back in the same place. This ByteBuffer is 1k long, so fits nicely into L1 cache. The old algorithm always loads and stores a long as 8 bytes. The new algorithm does an N-way branch, always loading and storing subwords according to their natural alignment. So, if the address is random and the size is long it will access 8 bytes 50% of the time, 4 shorts 25% of the time, 2 ints 12.5% of the time, and 1 long 12.5% of the time. So, for every random load/store we have a 4-way branch. The new algorithm is slightly slower because of branch misprediction. old: 2.17 IPC, 0.08% branch-misses, 91,965,281,215 cycles new: 1.23 IPC, 6.11% branch-misses, 99,925,255,682 cycles ...but it executes fewer instructions so we're only talking about some 10% slowdown. I think this is the worst case (or something close to the worst case) for the new algorithm. So, I think we're OK performance-wise. John: I'm waiting for an answer to my question here before I submit a webrev for approval. http://mail.openjdk.java.net/pipermail/panama-dev/2015-March/000099.html Andrew.