> I can't think of any reason it should be implemented in that way as
> long as the cache protocol has a total order (which it must given that
> the μops that generate the cache coherency protocol traffic have a
> total order), a state transition from X to E can be done in a bounded
> number of cycles.

my understanding is that in this context this only means that different
processors see the same order.  it doesn't say anything about fairness.

> The read function will try to find a value for addr in cache, then
> from memory. If the LOCK-prefixed instruction's decomposed read μop
> results in this behavior, a RFO miss can and will happen multiple
> times. This will stall the pipeline for multiple memory lookups. You
> can detect this with pipeline stall performance counters that will be
> measurably (with significance) higher on the starved threads.
> Otherwise, the pipeline stall counter should closely match the RFO
> miss and cache miss counters.

yes.

> For ainc() specifically, unless it was inlined (which ISTR the Plan 9
> C compilers don't do, but you'd know that way better than me), I can't
> imagine that screwing things up. The MOV's can't be LOCK-prepended
> anyway (nor do they deal with memory), and this gives other processors
> time to do cache coherency traffic.

it doesn't matter if this is hard to do.  if it is possible under any 
circumstances,
with any protcol-adhering implementation, then the assertion that amd64
lock is wait-free is false.

- erik

Reply via email to