On 9 November 2015 at 10:39, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia.com> wrote:
>
>
>
> From: EXT Nicolas Morey-Chaisemartin [mailto:nmo...@kalray.eu]
> Sent: Thursday, November 05, 2015 6:23 PM
> To: Savolainen, Petri (Nokia - FI/Espoo); Bill Fischofer
> Cc: LNG ODP Mailman List
> Subject: Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add 
> odp_sync_loads
>
>
> On 11/05/2015 04:54 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> Majority of platforms are cache coherent, for a  non-coherent platform you’d 
> need to define some additional API calls. This is documenting the current 
> assumption that application does this …

Depending on how you signal the consumer of the shared data, you might
need different barriers. E.g. if signalling the consumer by writing to
some device register which generates an interrupt to the consumer, on
ARM you need DSB, DMB is normally not enough. I doubt
__atomic_thread_fence generates a DSB instruction. So defining a
generic mechanism which can be used regardless of the signalling
method would require the strongest barrier. But this might still not
be enough for the actual scenarios which these calls are intended for.

If there is no synchronisation between producer and consumer, you have
a data race and the program behaviour is undefined. The only
work-around is to make *all* shared data atomically read and written,
variable by variable. But then you couldn't atomically read a set of
variables (if that is important).

>
> // data in shared memory
> int foo;
>
> Thread A                                                      Thread B
>
> ev = schedule();
> msg = event_to_message(ev);
>
> if (msg == UPDATE_FOO) {
>     foo++;
>     msg = FOO_UPDATED;
>     queue_enq(msg);

release operation with release ordering, all stores before queue_enq()
observable when the msg event can be observe (e.g. by dequeueing it
explicitly or from odp_schedule). Also guarantees that loads before
queue_enq() have completed so that a load that misses in the cache
will not read some data that was updated after queue_enq().

> }
>
>                                                                  ev = 
> schedule();

acquire operation, guarantees that following loads (and stores but
they are seldom speculated) will not be issued before the event has
been seen and properly "acquired".

>
>                                                                  msg = 
> event_to_message(ev);
>
>                                                                  if (msg == 
> FOO_UPDATED) {
>                                                                      
> printf(“foo is now %\n”, foo);
>                                                                  }
>
>
> … intead of this …
>
> Thread A                                                      Thread B
>
> ev = schedule();
> msg = event_to_message(ev);
>
> if (msg == UPDATE_FOO) {
>     // Make sure that load of “foo” is not moved before schedule()
>     odp_sync_loads();

odp_sync_acquire() would be a better name.

>
>     foo++;
>     // Make sure that store of “foo” is not moved after queue_enq()
>    odp_sync_stores();

odp_sync_release() would be a better name.

I am somewhat OK with odp_sync_acquire/odp_sync_release (should also
take vector of memory segments per suggestion below) as long as it is
required that producer and consumer use the same ODP calls.

>     msg = FOO_UPDATED;
>     queue_enq(msg);
> }
>
>                                                                  ev = 
> schedule();
>                                                                  msg = 
> event_to_message(ev);
>
>                                                                  if (msg == 
> FOO_UPDATED) {
>                                                                      // Make 
> sure that load of “foo” is not moved before schedule()
>                                                                      
> odp_sync_loads();
>                                                                      
> printf(“foo is now %\n”, foo);
>                                                                  }
>
>
> As you can see it would a lot of waste if application must always sync for 
> stores and loads explicitly.
> I see your point.
> I re-read carefully the documentation and maybe we can work around it.
> The sync_stores explicitly states that all store operations are globally 
> visible. So cache needs to be flushed.

If the Kalray HW isn't cache coherent between CPU's, neither old-style
(volatile + barriers) or C11-style shared memory programming will
work. Probably some platform specific code is necessary in order to
share data between CPU's and to properly synchronise the access.
Perhaps the acquire and release operations should take a vector of
addresses and sizes, specifying which data is being shared so to
minimise cache flushing (in the producer) and invalidation (in the
consumer).

>
> The load part only states that previous memory operations are complete, and 
> cannot be reordered around it.
> It does not explicitly states that the cache needs to be up-to-date. We 
> internally use "read memory barrier" internally as DINVAL + FENCE but it is 
> not necessary the case.
>
> If my understanding of the function fits with yours, I'm good with this 
> patches. If not, we need to discuss things some more.
>
> Nicolas
>
>
>
> The above highlights how it should *not* be used (== not needed with 
> ready-made ODP sync mechanisms). It’s intended to be used with non-ODP sync 
> mechanisms like the one under. Store and load syncs are usually a pair 
> (sync_stores -> sync_loads).

If either or both of the producer and consumer cannot use some common
ODP mechanism, the mechanism they actually use must be well defined
(e.g. must use DMB or DSB on ARM) so that both sides pair properly. So
for one side to use e.g. odp_sync_stores() (without knowing the actual
implementation) and the other side use whatever mechanism (not
odp_sync_loads) is a potential cause of problems. It is difficult for
me to see a generic implementation (in ODP) that can always pair with
whatever undefined mechanism is used by the peer. I think the code in
an ODP application that synchronises with a non-ODP producer or
consumer must use the specific mechanism agreed on and ODP cannot
provide any generic function that will work regardless.

>
>
> -Petri
>
>
> // data and flag in shared memory
> volatile int flag = 0;

int flag = 0; is enough

>
> int foo = 0;
>
> Thread A                         Thread B
>
> foo = 0xbeef;
>
> // Make sure that store
> // of "foo" is visible before
> // store of "flag"
> odp_sync_stores();
> flag = 1;

Must be implemented as:
__atomic_store_n(&flag, 1, __ATOMIC_RELEASE);
All loads and stores before writing to the flag that releases the
shared data must be observable before the store to flag is observable.


>
>                                  while (flag == 0)

while (__atomic_load_n(&flag, __ATOMIC_ACQUIRE) == 0)
No load or store after the load of the flag must be observable by
other threads before the flag has been "acquired". Now speculative
stores are not so common but loads could be reordered so reading the
(stale) shared data before the flag has been set.

>                                      spin_and_wait();
>
>                                  // Make sure that load of
>                                  // "foo" does not happen before
>                                  // we see the flag changing
>                                  odp_sync_loads();
>
>                                  printf(“foo is now %\n”, foo);
>
>
>
>
>
>
> _______________________________________________
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to