On Feb 24, 2015, at 2:48 PM, Andrew Haley <a...@redhat.com> wrote: >> >> I am all for keeping more code in Java if we can. I don't know enough about >> assembler-based optimizations to determine if it might be possible to do >> better on certain CPU architectures. > > Me either, but I have tested this on the architectures I have, and I > suspect that C2 optimization is good enough. And we'd have to write > assembly code for machines we haven't got; something for the future, I > think. >
Agreed. >> One advantage, AFAIU, to intrinsics is they are not subject to the vagaries >> of inlining thresholds. It's important that the loops operating over the >> arrays to be compiled efficiently otherwise performance can drop off the >> cliff if thresholds are reached within the loop. Perhaps these methods are >> small enough it is not an issue? and also perhaps that is not a sufficient >> argument to justify the cost of an intrinsic (and we should be really >> tweaking the inlining mechanism)? > > Maybe so. There are essentially two ways to do this: new compiler > node types which punt everything to the back end (and therefore > require back-end authors to write them) or generic expanders, which is > how many of the existing intrinsics are done. Generic C2 code would, > I suspect, be worse than writing this in Java bacause it would be > lacking profile data. > >> With that in mind is there any need to intrinsify the new methods at all >> given those new Java methods can defer to the older ones based on a constant >> check? Also should that anyway be done for the interpreter? >> >> >> private static final boolean IS_UNALIGNED = theUnsafe.unalignedAccess(); >> >> public void putIntUnaligned(Object o, long offset, int x) { if (IS_UNALIGNED >> || (offset & 3) == 0) { putInt(o, offset, x); } else if (byteOrder == >> BIG_ENDIAN) { putIntB(o, offset, x); } else { putIntL(o, offset, x); } } > > Yes. It certainly could be done like this but I think C1 doesn't do > the optimization to remove the IS_UNALIGNED test, so we'd still want > the C1 builtins. Hmm.... if i run the following code on my Mac: import java.nio.ByteOrder; public class Y { static final boolean X = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN; // Just get a constant from somewhere... public static void main(String[] args) { Y y = new Y(); int s = 0; for (int i = 0; i < 1000000; i++) { s += y.x(i); } System.out.println(s); } public int x(int i) { if (X || (i & 3) == 0) { return i + i; } else { return Integer.sum(i, i); } } } With the options: -XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Then Y.x is compiled to: 0x000000011155ea80: mov %eax,-0x14000(%rsp) 0x000000011155ea87: push %rbp 0x000000011155ea88: sub $0x30,%rsp ;*getstatic X ; - Y::x@0 (line 43) 0x000000011155ea8c: mov %rdx,%rax 0x000000011155ea8f: add %edx,%eax 0x000000011155ea91: add $0x30,%rsp 0x000000011155ea95: pop %rbp 0x000000011155ea96: test %eax,-0x62ca9c(%rip) # 0x0000000110f32000 ; {poll_return} 0x000000011155ea9c: retq For tiered compilation i do see code generated with a test, jump and de-opt: 0x00000001112fe020: mov %eax,-0x14000(%rsp) 0x00000001112fe027: push %rbp 0x00000001112fe028: sub $0x30,%rsp 0x00000001112fe02c: mov $0x125633860,%rax ; {metadata(method data for {method} {0x0000000125633508} 'x' '(I)I' in 'Y')} 0x00000001112fe036: mov 0xdc(%rax),%edi 0x00000001112fe03c: add $0x8,%edi 0x00000001112fe03f: mov %edi,0xdc(%rax) 0x00000001112fe045: mov $0x125633508,%rax ; {metadata({method} {0x0000000125633508} 'x' '(I)I' in 'Y')} 0x00000001112fe04f: and $0x1ff8,%edi 0x00000001112fe055: cmp $0x0,%edi 0x00000001112fe058: je 0x00000001112fe07f ;*getstatic X ; - Y::x@0 (line 43) 0x00000001112fe05e: mov $0x125633860,%rax ; {metadata(method data for {method} {0x0000000125633508} 'x' '(I)I' in 'Y')} 0x00000001112fe068: incl 0x118(%rax) ;*ifne ; - Y::x@3 (line 43) 0x00000001112fe06e: mov %rdx,%rax 0x00000001112fe071: add %edx,%eax 0x00000001112fe073: add $0x30,%rsp 0x00000001112fe077: pop %rbp 0x00000001112fe078: test %eax,-0x62f07e(%rip) # 0x0000000110ccf000 ; {poll_return} 0x00000001112fe07e: retq 0x00000001112fe07f: mov %rax,0x8(%rsp) 0x00000001112fe084: movq $0xffffffffffffffff,(%rsp) 0x00000001112fe08c: callq 0x0000000110e63c60 ; OopMap{rsi=Oop off=145} ;*synchronization entry ; - Y::x@-1 (line 43) ; {runtime_call} But the method gets compiled later on to the shape of the former code e.g.: 471 16 3 Y::x (22 bytes) 472 17 4 Y::x (22 bytes) 472 16 3 Y::x (22 bytes) made not entrant > Perhaps we could do without the C2 builtins but they > cost very little, they save C2 a fair amount of work, and they remove > the vagaries of inlining. Tis true, the changes are small. > I take your point about the interpreter, > though. > >> I see you optimized the unaligned getLong by reading two aligned longs and >> then bit twiddled. It seems harder to optimize the putLong by straddling an >> aligned putInt with one to three required putByte. > > Sure, that's always a possibility. I have code to do it but it was > all getting rather complicated for my taste. I thought it might. > >>> Also, these methods have the additional benefit that they are always atomic >>> as long as the data are naturally aligned. >> >> We should probably document that in general access is not guaranteed to be >> atomic and an implementation detail that it currently is when naturally so. > > I think that's a good idea. The jcstress tests already come up with a > warning that the implementation is not atomic; this is not required, > but a high-quality implementation will be. > >>> This does result in rather a lot of code for the methods for all sizes and >>> endiannesses, but none of it will be used on machines with unaligned >>> hardware support except in the interpreter. (Perhaps the interpreter too >>> could have intrinsics?) >>> >>> I have changed HeapByteBuffer to use these methods, with a major >>> performance improvement. I've also provided Unsafe methods to query >>> endianness and alignment support. >> >> If we expose the endianness query via a new method in unsafe we should reuse >> that in java.nio.Bits and get rid of the associated static code block. > > Sure, I already did that. > Locally i guess? (just in case i missed something in the current webrev). Paul.