Thanks a lot for your answer and for the confirmation that my understanding
is correct.

On Wed, Feb 5, 2025 at 12:30 PM Aleksey Shipilev <[email protected]>
wrote:

> On 2/3/25 12:06, Peter Veentjer wrote:
> > Imagine the following code:
> >
> > ... lot of writes writes to the buffer
> > buffer.putInt(a_offset,a_value)  (1)
> > buffer.putRelease(b_offset,b_value) (2)
> > releaseFence() (3)
> > buffer.putInt(c_offset,c_value) (4)
> >
> > Buffer is a chunk of memory that is shared with another process and the
> writes need to be seen in
> > order. So when 'b' is seen, 'a' should be seen. And when 'c' is seen,
> 'b' should be seen. There is
> > no other synchronization.
> >
> > All offsets are guaranteed to be naturally aligned. All the putInts are
> plain puts (using Unsafe).
> >
> > The putRelease (2) will ensure that 'a' is seen before 'b', and it will
> ensure atomicity and
> > visibility of 'b' (so the appropriate compiler and memory fences where
> needed).
> >
> > The releaseFence (3) will ensure that b is seen before c.
>
> Looks to me this fence can be replaced with releasing store of "c":
>
>   buffer.putInt(a_offset,a_value)
>   buffer.putRelease(b_offset,b_value)
>   buffer.putRelease(c_offset,c_value)
>
> My preference is almost always to avoid the explicit fences if you can
> control the memory ordering
> of the actual accesses. Using putRelease instead of explicit fence also
> forces you think about the
> symmetries: should all loads of "c" be performed with getAcquire to match
> the putRelease?
>
> > My question is about (4). Since it is a plain store, the compiler can do
> a ton of trickery including
> > the delay of visibility of (4). Is my understanding correct and is there
> anything else that could go
> > wrong?
>
> The common wisdom is indeed "let's put non-plain memory access mode, so
> the access is hopefully more
> prompt", but I have not seen any of these effects thoroughly quantified
> beyond "let's forbid the
> compiler to yank our access out of the loop". Maybe I have not looked hard
> enough.
>
> I suspect the delays introduced by compiler moving code around in
> sequential code streams is on the
> scale where it does not matter all that much for end-to-end latency. The
> only (?) place where code
> movement impact could be multiplied to a macro-effect is when the memory
> ops shift in/out/around the
> loops. I would not be overly concerned about latency impact of reordering
> within the short straight
> code stream.
>
> You can try to measure it with producer-consumer / ping-pong style
> benchmarks: put more memory ops
> around (4), turn on instruction scheduler randomizers (-XX:+StressLCM
> should be useful here, maybe
> -XX:+StressGCM), see if there is an impact. I suspect the effect is too
> fine-grained to be
> accurately measured with direct timing measurements, so you'll need to get
> creative how to measure
> "promptness".
>
> > What would be the lowest memory access mode that would resolve this
> problem? My guess is that the
> > last putInt, should be a putIntOpaque.
>
> Yes, in current Hotspot, opaque would effectively pin the access in place,
> so it would be exposed to
> hardware in the order closer to original source code order. Then it is up
> to hardware to see when to
> perform the store. But as I said above, I'll be surprised if it actually
> matters.
>
> Thanks,
> -Aleksey
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/CAGuAWdAsWprk9BK46iJdZ_w1wPBcM4OCkDgCLTAP98B4VCPscw%40mail.gmail.com.

Reply via email to