2014-05-19 18:05 GMT-04:00 erik quanstrom <quans...@quanstro.net>:
> On Mon May 19 17:02:57 EDT 2014, devon.od...@gmail.com wrote:
>> So you seem to be worried that N processors in a tight loop of LOCK
>> XADD could have a single processor. This isn't a problem because
>> locked instructions have total order. Section 8.2.3.8:
>>
>> "The memory-ordering model ensures that all processors agree on a
>> single execution order of all locked instructions, including those
>> that are larger than 8 bytes or are not naturally aligned."
>
> i don't think this solves any problems.  given thread 0-n all executing
> LOCK instructions, here's a valid ordering:
>
> 0       1       2               n
> lock    stall   stall   ...     stall
> lock    stall   stall   ...     stall
> ...                     ...
> lock    stall   stall   ...     stall
>
> i'm not sure if the LOCK really changes the situation.  any old exclusive
> cacheline access should do?

It is an ordering, but I don't think it's a valid one: your ellipses
suggest an unbounded execution time (given the context of the
discussion). I don't think that's valid because the protocol can't
possibly negotiate execution for more instructions than it has space
for in its pipeline. Furthermore, the pipeline cannot possibly be
filled with LOCK-prefixed instructions because it also needs to
schedule instruction loading, and it pipelines μops, not whole
instructions anyway. Furthermore, part of the execution cycle is
decomposing an instruction into its μop parts. At some point, that
processor is not going to be executing a LOCK instruction, it is going
to be executing some other μop (like decoding the next LOCK-prefixed
instruction it wants to execute). This won't be done with any
synchronization. When this happens, other processors will execute
their LOCK-prefixed instructions.

The only way I could think to try to force this execution history was
to unroll a loop of LOCK-prefixed instructions. In a tight loop, a
program I wrote to do LOCK XADD 10 billion times per-thread (across 4
threads on my 4 core system) finished with a standard deviation in
cycle count of around 1%. When I unroll the loop enough to fill the
pipeline, the stddev actually decreases (to about 0.5%), which leads
me to believe that the processor actively mitigates that sort of
instruction "attack" for highly concurrent workloads.

So either way, you're still bounded. Eventually p0 has to go do
something that isn't a LOCK-prefixed instruction, like decode the next
one. I don't know how to get the execution order you suggest. You'd
have to manage to fill the pipeline on the processor while starving
the pipeline on the others and preventing them from executing any
further instructions. Instruction load and decode stages are shared,
so I really don't see how you'd manage this without using PAUSE
strategically. You'd have to con the processor into executing that
order. At that point, just use a mutex :)

--dho

> the documentation appears not to cover this completely.
>
> - erik
>

Reply via email to