On Feb 24, 2015, at 2:48 PM, Andrew Haley <a...@redhat.com> wrote:
>> 
>> I am all for keeping more code in Java if we can. I don't know enough about 
>> assembler-based optimizations to determine if it might be possible to do 
>> better on certain CPU architectures.
> 
> Me either, but I have tested this on the architectures I have, and I
> suspect that C2 optimization is good enough.  And we'd have to write
> assembly code for machines we haven't got; something for the future, I
> think.
> 

Agreed.


>> One advantage, AFAIU, to intrinsics is they are not subject to the vagaries 
>> of inlining thresholds. It's important that the loops operating over the 
>> arrays to be compiled efficiently otherwise performance can drop off the 
>> cliff if thresholds are reached within the loop. Perhaps these methods are 
>> small enough it is not an issue? and also perhaps that is not a sufficient 
>> argument to justify the cost of an intrinsic (and we should be really 
>> tweaking the inlining mechanism)?
> 
> Maybe so.  There are essentially two ways to do this: new compiler
> node types which punt everything to the back end (and therefore
> require back-end authors to write them) or generic expanders, which is
> how many of the existing intrinsics are done.  Generic C2 code would,
> I suspect, be worse than writing this in Java bacause it would be
> lacking profile data.
> 
>> With that in mind is there any need to intrinsify the new methods at all 
>> given those new Java methods can defer to the older ones based on a constant 
>> check? Also should that anyway be done for the interpreter?
>> 
>> 
>> private static final boolean IS_UNALIGNED = theUnsafe.unalignedAccess();
>> 
>> public void putIntUnaligned(Object o, long offset, int x) { if (IS_UNALIGNED 
>> || (offset & 3) == 0) { putInt(o, offset, x); } else if (byteOrder == 
>> BIG_ENDIAN) { putIntB(o, offset, x); } else { putIntL(o, offset, x); } }
> 
> Yes.  It certainly could be done like this but I think C1 doesn't do
> the optimization to remove the IS_UNALIGNED test, so we'd still want
> the C1 builtins.

Hmm.... if i run the following code on my Mac:

import java.nio.ByteOrder;

public class Y {

    static final boolean X = ByteOrder.nativeOrder() == 
ByteOrder.LITTLE_ENDIAN;     // Just get a constant from somewhere...

    public static void main(String[] args) {
        Y y = new Y();

        int s = 0;
        for (int i = 0; i < 1000000; i++) {
            s += y.x(i);
        }
        System.out.println(s);
    }

    public int x(int i) {
        if (X || (i & 3) == 0) {
            return i + i;
        }
        else {
            return Integer.sum(i, i);
        }
    }
}

With the options:

  -XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

Then Y.x is compiled to:

  0x000000011155ea80: mov    %eax,-0x14000(%rsp)
  0x000000011155ea87: push   %rbp
  0x000000011155ea88: sub    $0x30,%rsp         ;*getstatic X
                                                ; - Y::x@0 (line 43)

  0x000000011155ea8c: mov    %rdx,%rax
  0x000000011155ea8f: add    %edx,%eax
  0x000000011155ea91: add    $0x30,%rsp
  0x000000011155ea95: pop    %rbp
  0x000000011155ea96: test   %eax,-0x62ca9c(%rip)        # 0x0000000110f32000
                                                ;   {poll_return}
  0x000000011155ea9c: retq   

For tiered compilation i do see code generated with a test, jump and de-opt:

  0x00000001112fe020: mov    %eax,-0x14000(%rsp)
  0x00000001112fe027: push   %rbp
  0x00000001112fe028: sub    $0x30,%rsp
  0x00000001112fe02c: mov    $0x125633860,%rax  ;   {metadata(method data for 
{method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe036: mov    0xdc(%rax),%edi
  0x00000001112fe03c: add    $0x8,%edi
  0x00000001112fe03f: mov    %edi,0xdc(%rax)
  0x00000001112fe045: mov    $0x125633508,%rax  ;   {metadata({method} 
{0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe04f: and    $0x1ff8,%edi
  0x00000001112fe055: cmp    $0x0,%edi
  0x00000001112fe058: je     0x00000001112fe07f  ;*getstatic X
                                                ; - Y::x@0 (line 43)

  0x00000001112fe05e: mov    $0x125633860,%rax  ;   {metadata(method data for 
{method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
  0x00000001112fe068: incl   0x118(%rax)        ;*ifne
                                                ; - Y::x@3 (line 43)

  0x00000001112fe06e: mov    %rdx,%rax
  0x00000001112fe071: add    %edx,%eax
  0x00000001112fe073: add    $0x30,%rsp
  0x00000001112fe077: pop    %rbp
  0x00000001112fe078: test   %eax,-0x62f07e(%rip)        # 0x0000000110ccf000
                                                ;   {poll_return}
  0x00000001112fe07e: retq   
  0x00000001112fe07f: mov    %rax,0x8(%rsp)
  0x00000001112fe084: movq   $0xffffffffffffffff,(%rsp)
  0x00000001112fe08c: callq  0x0000000110e63c60  ; OopMap{rsi=Oop off=145}
                                                ;*synchronization entry
                                                ; - Y::x@-1 (line 43)
                                                ;   {runtime_call}

But the method gets compiled later on to the shape of the former code e.g.:

    471   16       3       Y::x (22 bytes)
    472   17       4       Y::x (22 bytes)
    472   16       3       Y::x (22 bytes)   made not entrant



>  Perhaps we could do without the C2 builtins but they
> cost very little, they save C2 a fair amount of work, and they remove
> the vagaries of inlining.

Tis true, the changes are small.


>  I take your point about the interpreter,
> though.
> 
>> I see you optimized the unaligned getLong by reading two aligned longs and 
>> then bit twiddled. It seems harder to optimize the putLong by straddling an 
>> aligned putInt with one to three required putByte.
> 
> Sure, that's always a possibility.  I have code to do it but it was
> all getting rather complicated for my taste.

I thought it might.


> 
>>> Also, these methods have the additional benefit that they are always atomic 
>>> as long as the data are naturally aligned.
>> 
>> We should probably document that in general access is not guaranteed to be 
>> atomic and an implementation detail that it currently is when naturally so.
> 
> I think that's a good idea.  The jcstress tests already come up with a
> warning that the implementation is not atomic; this is not required,
> but a high-quality implementation will be.
> 
>>> This does result in rather a lot of code for the methods for all sizes and 
>>> endiannesses, but none of it will be used on machines with unaligned 
>>> hardware support except in the interpreter.  (Perhaps the interpreter too 
>>> could have intrinsics?)
>>> 
>>> I have changed HeapByteBuffer to use these methods, with a major 
>>> performance improvement.  I've also provided Unsafe methods to query 
>>> endianness and alignment support.
>> 
>> If we expose the endianness query via a new method in unsafe we should reuse 
>> that in java.nio.Bits and get rid of the associated static code block.
> 
> Sure, I already did that.
> 

Locally i guess? (just in case i missed something in the current webrev).

Paul.

Reply via email to