> On Oct 8, 2015, at 6:18 PM, John Rose <john.r.r...@oracle.com> wrote:
> 
> On Oct 8, 2015, at 12:39 AM, Gil Tene <g...@azul.com> wrote:
>> 
>> On the one hand:
>> 
>> I like the idea of (an optional?) boolean parameter as a means of hinting at 
>> the thing that may terminate the spin. It's probably much more general than 
>> identifying a specific field or address. And it can be used to cover cases 
>> that poll multiple addresses (an or in the boolean) or look a termination 
>> time. If the JVM can track down the boolean's evaluation to dependencies on 
>> specific memory state changes, it could pass it on to hardware, if such 
>> hardware exists.
> 
> Yep.  And there is a user-mode MWAIT in SPARC M7, today.

Cool. Didn't know that. So now It's SPARC M7 and ARM v8. Both fairly new, but 
the pattern of monitoring a single address (or range) and waiting on a 
potential change to it seems common (and similar to the kernel mode 
MONITOR/MWAIT in x86). Anything similar coming (or already here) in Power or 
MIPS?

> For Intel, Dave Dice wrote this up:
>  https://blogs.oracle.com/dave/entry/monitor_mwait_for_spin_loops

Cool writeup. But with the current need to transition to kernel mode this may 
work for loops that want to idle and save power and are willing to sacrifice 
reaction time to do so. But it is the opposite of what a spinHintLoop() would 
typically be looking to do. On modern x86, for example, adding a pause 
instruction improves the reaction speed of the spin loop (see charts attached 
to JEP), but adding the trapping cost and protection mode transition of a 
system call to do an MWAIT will almost certainly do the opposite.

If/when MONITOR/MWAIT becomes available in user mode, it will join ARM v8 and 
SPARC M7 in a common useful paradigm.

> Also, from a cross-platform POV, a boolean would provide an easy to use 
> "hook" for profiling how often the polling is failing.  Failure frequency is 
> an important input to the tuning of spin loops, isn't it?  Why not feed that 
> info through to the JVM?

I don't follow. Perhaps I'm missing something. Spin loops are "strange" in that 
they tend to not care about how "fast" they spin, but do care about their 
reaction time to a change in the thing(s) they are spinning on. I don't think 
profiling will help here…

E.g. in the example tests for this JEP on Ivy Bridge Xeons, adding an 
intrinsified spinLoopHint() to the a simple spin volatile value loop appears to 
reduce the "spin throughput" by a significant ratio (3x-5x for L1-sharing 
threads), but also reduces the reaction time by 35-50%.

> ...
>> and if/when it does, I'm not sure the semantics of passing the boolean 
>> through are enough to cover the actual way to use such hardware when it 
>> becomes available.
> 
> The alternative is to have the JIT pattern-match for loop control around the 
> call to Thread.yield. That is obviously less robust than having the user 
> thread the poll condition bit through the poll primitive.

I dont' think that's the alternative. The alternative(s) I suggest require no 
analysis by the JIT:

The main means of spin loop hinting I am suggesting is a simple no args hint. 
[Folks seem to be converging on using Thread as the home for this stuff, so 
I'll use that]:

E.g.:
while (!done) {
        Thread.spinLoopHint();
}

The second form I'm suggesting (mostly in reaction to discussion on this 
thread) directly captures the notion that a single address is being monitored:

E.g. 

volatile boolean done;
static final Field doneField = …;
...
Thread.spinExecuteWhileTrue( () -> !done, doneField, this ); // ugly method 
name I'm not married to...

or a slighltly more complicated: 

Thread.spinExecuteWhileTrue( () -> { count++; return !done;} , doneField, this 
); 

[These Thread.spinExecuteWhileTrue() examples will execute the BooleanSupplier 
each time, but will only watch the specified field for changes in the spin 
loop, potentially pausing the loop until a change in the field is detected, but 
will not pause indefinitely. This can be implemented with a MONITOR/MWAIT, 
WFE/SEVL, or by just using a PAUSE instruction and not watching the field at 
all.]

(for Java 9, a varhandle variant of the above reflection based model is 
probably more appropriate. I spelled this with the reflection form for 
readability by pre-varhandles-speakers).

Neither of these forms require any specific JIT matching or exploration. We 
know the first form is fairly robust on architectures that support stuff like 
PAUSE. The second form will probably be robust both architectures that support 
MWAIT or WFE, and on those that support PAUSE (those just won't watch anything).

On how this differs from a single boolean parameter: My notion (in the example 
above) of a single poll variable would be one that specifically designates the 
poll variable as a field (or maybe array index as an option), rather than 
provide a boolean parameter that is potentially evaluated based on data read 
from more than one memory location.

The issue is that while it's an easy fit if the boolean is computed based on 
evaluating a single address, it becomes fragile if multiple addresses are 
involved and the hardware can only watch one (which is the current trend for 
ARM v8, SPARC M7, and a potential MONITOR/WAIT x86). It would be "hard" for a 
JIT to figure out which of the addresses read to compute the bollean should be 
watched in the spin. And getting it wrong can have potentially surprising 
consequences (not just lack of benefit, but terribly slow execution due to 
waiting for something that is not going to be externally modified and timing 
out each time before spinning).

e.g. these probably look good to a programmer:

while (!pollSpinExit(done1 || done 2 || (count++ > limit)) {
}

And it could translate to the following rough mixed pseudo code:

        SEVL
loop:
        WFE
        ldaxrh %done1, [done]   
        if (!(%done1 || done2 || (count++ > limit)) goto loop:
        …

But it could also be translated to:

        SEVL
loop:
        WFE     
        ldaxrh %done2, [done]   
        if (!(done1 || %done2 || (count++ > limit)) goto loop:
        …

(or a third option that decides to watch count instead).

None of these are "right". And there is nothing in the semantics that suggests 
which one to expect.

You could fall back and say that you would only get the benefit if there is 
exactly one address used in deriving the boolean, but this would probably make 
it hard to code to and maintain. A form that forces you to specific the polling 
parameter would be less generic in expression, but will be less fragile to 
program to as well, IMO.

> 
>> It is probably premature to design a generic way to provide addresses and/or 
>> state to this "spin until something interesting changes" stuff without 
>> looking at working examples. A single watched address API is much more 
>> likely to fit current implementations without being fragile.
>> 
>> ARM v8's WFE is probably the most real user-mode-accesible thing for this 
>> right now (MWAIT isn't real yet, as it's not accessible from user mode). We 
>> can look at an example of how a spinloop needs to coordinate the use of WFE, 
>> SEVL, and the evaluation of memory location with load exclusive operations 
>> here: http://lxr.free-electrons.com/source/arch/arm64/include/asm/spinlock.h 
>> . The tricky part is that the SEVL needs to immediately proceed the loop 
>> (and all accesses that need to be watched by the WFE), but can't be part of 
>> the loop (if were in the loop the WFE would always trigger immediately). But 
>> the code in the spinning loop can can only track a single address (the 
>> exclusive tag in load exclusive applies only the the most recent address 
>> used), so it would be wrong to allow generic code in the spin (it would have 
>> to be code that watches exactly one address). 
>> 
>> My suspicion is that the "right" way to capture the various ways a spin loop 
>> would need to interact with RFE logic will be different than tracking things 
>> that can generically affect the value of a boolean. E.g. the evaluation of 
>> the boolean could be based on multiple addresses, and since it's not clear 
>> (in the API) that this is a problem, the benefits derived would be fragile.
> 
> Having the JIT explore nearby loop structure for memory references is even 
> more fragile.

Agreed. Which is why I'm not suggesting it.

> If we can agree that (a) there are advantages to profiling the boolean 
> parameter for all platforms, and (b) the single-poll-variable case is likely 
> to be optimizable sooner *with* a parameter than *without*, maybe this is 
> enough to tip the scales towards boolean parameter.

I guess that's where we differ: I don't see a benefit in profiling the spin 
loop, so we disagree on (a). And hence (b) is not relevant…

Maybe I'm mis-reading what you mean by "profiling" and "optimizing" above?

> The idea would be that programmers would take a little extra thought when 
> using yield(Z)Z, and get paid immediately from good profiling.  They would 
> get paid again later if and when platforms analyze data dependencies on the Z.
> 
> If there's no initial payoff, then, yes, it is hard asking programmers to 
> expend extra thought that only benefits on some platforrms.

Whatever the choices end up being, we could provide multiple signatures or 
APIs. E.g. I think that the no-args spinLoopHint() is the de-facto spinning 
model for x86 and Power (and have been for over a decade for everything outside 
of Java). So it's a safe bet and a natural form. The 
spin-execute-something-while-watching-a-single-address model is *probably* a 
good fit for some relatively young but very useful hardware capabilities, and 
can probably be captured in a long-lasting API as well.

More complicated boolean-derived-from-pretty-much-anything or multi-address 
watching schemes are IMO too early to evaluate. E.g. they could potentially 
leverage some just-around-the-corner (or recently arrived) features like TSX 
and NCAS schemes, but since there is relatively little experience with using 
such things for spinning (outside of Java), it is probably pre-mature to 
solidify a Java API for them.

BTW, even with user-mode MWAIT and cousins, and with the watch-a-single-address 
API forms, we may be looking at two separate motivations, and may want to 
consider a hint of which one is intended. E.g. one of spinLoopHint()'s main 
drivers is latency improvement,  and the other is power reduction (with 
potential speed benefits or just power savings benefits). It appears that on 
x86 a PAUSE provides both, so there is no choice needed there. But MWAIT may be 
much more of a power-centric approach that sacrifices latency, and that may be 
OK for some and un-OK for others. We may want to have API variants that allow a 
hint about whether power-reduction or latency-reduction is the preferred driver.

> 
> — John
> 

Reply via email to