> I can't think of any reason it should be implemented in that way as > long as the cache protocol has a total order (which it must given that > the μops that generate the cache coherency protocol traffic have a > total order), a state transition from X to E can be done in a bounded > number of cycles.
my understanding is that in this context this only means that different processors see the same order. it doesn't say anything about fairness. > The read function will try to find a value for addr in cache, then > from memory. If the LOCK-prefixed instruction's decomposed read μop > results in this behavior, a RFO miss can and will happen multiple > times. This will stall the pipeline for multiple memory lookups. You > can detect this with pipeline stall performance counters that will be > measurably (with significance) higher on the starved threads. > Otherwise, the pipeline stall counter should closely match the RFO > miss and cache miss counters. yes. > For ainc() specifically, unless it was inlined (which ISTR the Plan 9 > C compilers don't do, but you'd know that way better than me), I can't > imagine that screwing things up. The MOV's can't be LOCK-prepended > anyway (nor do they deal with memory), and this gives other processors > time to do cache coherency traffic. it doesn't matter if this is hard to do. if it is possible under any circumstances, with any protcol-adhering implementation, then the assertion that amd64 lock is wait-free is false. - erik