Re: Single writer counter: how expensive is a volatile read?

Vitaly Davidovich Sat, 29 Oct 2016 13:32:01 -0700

On Saturday, October 29, 2016, Tom Lee <m...@tomlee.co> wrote:

> Peter's wording after that snippet implied he was trying to remove a
>> volatile read on the reader, so I think Aleksey is right that get() was
>> intended to return the local.
>
>
> I'm not so sure -- you'll notice in the last example he provided he does
> avoid a volatile read in inc() ... and since inc() will be run on a single
> thread and all reads are on the atomic/volatile var, I think it all works
> out. He's basically using `local` as a "thread-local" cache of the atomic
> value on the writer so he doesn't have to read the atomic.
>
Right, but I thought he was trying to avoid a volatile read on the
reader(s) side.  But yes, he should clarify since clearly there are a few
different interpretations in this thread.


On the inc() side, he could use getOpaque as well to avoid any costly CPU
fences on archs that would have them and also avoid two separate writes
(potentially not an issue if they hit same cacheline or adjacent
cachelines, depending on microarchitectural details like write combining,
adjacent cacheline prefetch, etc).  Two separate writes, particularly one
to the atomic instance, can also exacerbate false sharing more easily.

>
> Peter, perhaps it's best if you can clarify but assuming that was your
> intent: whether there's a performance change on x86 is another question
> entirely. You're much better off benchmarking, but here's some wild
> speculation for you to consider [and for others to rubbish :)]:
>
> I would expect no obvious benefit on x86 between the `local` code and the
> `lazySet` version assuming this code is running in isolation, no context
> switches and/or there is nothing else invalidating cache lines. Assuming a
> single writer thread, the cache line should never leave the L1 cache on the
> writer thread after the first write. As a result, on all but the first call
> to inc() you should read the current value from the L1 cache. IIRC reads
> from L1 are roughly as fast as a read from a register, so I *think* the
> writer thread's "volatile read" should actually be quite fast.
>
Yes, I believe so as well - that was roughly my point for the volatile read
in get() as well, except get()'ers will take cache misses.  The writer will
maintain the cacheline in modified or owned state, and so the readers are
the ones taking cache misses as the line is modified.  This is really no
different than plain stores.


> Vitaly, whatever his intent the other details in your email were
> interesting all the same -- I'm not especially familiar with the newer C++
> memory ordering stuff. One question:
>
> The main "issue" here is readers will take cache misses and coherence
>> traffic as their cacheline will be invalidated when the writer stores the
>> new value.
>
>
> If I'm reading what you're saying RE: getOpaque correctly, seems like the
> cache line invalidation performance hit would be unavoidable except on more
> relaxed archs  [than x86], right?
>
Yes, I believe you cannot avoid the readers missing on the cache because
it's constantly invalidated by the writer - depending on cache coherence
protocol/details, the reader may be invalidated or the writer may broadcast
changes to the reader cores directly (skipping RAM).  The performance
implications will also depend on whether the reader(s) and writer are on
the same socket or different sockets.

Picture the interaction here with plain stores and loads (i.e. no atomics,
no volatile) and assume compiler doesn't hoist any reads.  There will be
normal coherence transactions no matter what.

As for getOpaque, my understanding is it's basically a compiler fence
("program order" semantics) - compiler must read the memory, and it cannot
reorder loads and stores across it.  This is kind of like when a compiler
sees a method call it cannot inline and doesn't know anything about - it
cannot reorder memory operations across it because it can't prove that
method doesn't modify memory.  Hence the term "opaque".  However, it
doesn't emit any CPU fences, and that's the difference between volatile
load and getOpaque on weaker memory model archs.

There's one thing I still can't get someone at Oracle to clarify, which is
whether getOpaque ensures atomicity of the read.  I believe it would, but I
don't think it's been explicitly stated thus far.

>
> Cheers,
> Tom
>
> On Sat, Oct 29, 2016 at 8:33 AM, Vitaly Davidovich <vita...@gmail.com
> <javascript:_e(%7B%7D,'cvml','vita...@gmail.com');>> wrote:
>
>>
>>
>> On Saturday, October 29, 2016, Olivier Bourgain <
>> olivierbourgai...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','olivierbourgai...@gmail.com');>> wrote:
>>
>>> I think this deserves a benchmark.
>>>
>>> Le samedi 29 octobre 2016 13:15:35 UTC+2, Aleksey Shipilev a écrit :
>>>>
>>>> On 10/29/2016 10:13 AM, Peter Veentjer wrote:
>>>> > So you get something like this:
>>>> >
>>>> > public class Counter {
>>>> >
>>>> >     private final AtomicLong c = new AtomicLong();
>>>> >
>>>> >     public void inc(){
>>>> >         long newValue = c.get()+1;
>>>> >         c.lazySet(newValue);
>>>> >     }
>>>> >
>>>> >     public long get(){
>>>> >         return c.get();
>>>> >     }
>>>> > }
>>>>
>>>> Does not work. With non-atomic updates, you set yourself for losing a
>>>> virtually unbounded number of updates. See e.g:
>>>> https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/
>>>> #pitfall-no-sync
>>>>
>>>>
>>> I agree that in the general case, this will lose updates, but Peter was
>>> asking for a "single writer performance counter", so I believe his code is
>>> correct.
>>>
>>> > But in this case you still need to do a volatile read to read the
>>>> actual
>>>> > value. In theory one could optimize this to:
>>>> >
>>>> > public class Counter {
>>>> >
>>>> >     private final AtomicLong c = new AtomicLong();
>>>> >     private long local;
>>>> >
>>>> >     public void inc(){
>>>> >         long newValue = local+1;
>>>> >         this.local = newValue;
>>>> >         c.lazySet(newValue);
>>>> >     }
>>>> >
>>>> >     public long get(){
>>>> >         return c.get();
>>>> >     }
>>>> > }
>>>>
>>>> I guess get() was meant to have "return local"? If so, this does not
>>>> work. With no ordering implied for the reads, you set yourself up for
>>>> this trouble:
>>>>
>>>
>>> Local is for the (single) writer, to avoid a volatile read to get the
>>> current counter value. Other threads (readers) need to perform a volatile
>>> read to get the correct value.
>>>
>>>
>>>
>> Peter's wording after that snippet implied he was trying to remove a
>> volatile read on the reader, so I think Aleksey is right that get() was
>> intended to return the local.
>>
>> However, as Aleksey noted, VH.getOpaque would be the right way to do that
>> (memory_order_relaxed pseudo equivalent in C++) to avoid compiler
>> surprises.  Although on x86 I don't think there's a difference with
>> volatile reads - both perform the load, and both serve as compiler fences
>> (getOpaque is doc'd as being in "program order").  On archs where a
>> volatile read involves a CPU fence, then getOpaque should be cheaper.
>>
>> The main "issue" here is readers will take cache misses and coherence
>> traffic as their cacheline will be invalidated when the writer stores the
>> new value.
>>
>>>
>>>> Counter c = ...;
>>>> while (c.get() < 100); // wait for it... forever!
>>>>
>>>> Because optimizers are free from hoisting your read before the loop,
>>>> reading the counter once, and reducing predicated loop into the
>>>> infinite
>>>> one. I would *guess* the fastest (not sanest, mind you) way out of this
>>>> is doing VarHandles:
>>>>
>>>> public class Counter {
>>>>      static final VarHandle VH = <get_VH_to_counter>;
>>>>
>>>>      long counter;
>>>>
>>>>      public void inc(){
>>>>          VH.getAndAdd(this, 1);
>>>>      }
>>>>
>>>>      public long get(){
>>>>          return VH.getOpaque(this);
>>>>      }
>>>> }
>>>>
>>>> Still, removing volatiles from get breaks whatever happens-before
>>>> guarantees for the counter. Which means, for example, you cannot use it
>>>> to "version" the objects:
>>>>
>>>> Counter cnt = new Counter();
>>>>
>>>> Thread 1:
>>>>   // assume cnt is zero
>>>>   x = 1;
>>>>   cnt.inc();
>>>>
>>>> Thread 2:
>>>>   if (cnt.get() == 1) {
>>>>     assert (x == 1); // oops, fails.
>>>>   }
>>>>
>>>> Thanks,
>>>> -Aleksey
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Sent from my phone
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com
>> <javascript:_e(%7B%7D,'cvml','mechanical-sympathy%2bunsubscr...@googlegroups.com');>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> *Tom Lee */ http://tomlee.co / @tglee <http://twitter.com/tglee>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com
> <javascript:_e(%7B%7D,'cvml','mechanical-sympathy%2bunsubscr...@googlegroups.com');>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Sent from my phone

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Single writer counter: how expensive is a volatile read?

Reply via email to