Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-26 Thread Savolainen, Petri (Nokia - FI/Espoo)
Ping. This improves current sync spec and implementation.

> -Original Message-
> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of EXT
> Petri Savolainen
> Sent: Monday, October 26, 2015 6:08 PM
> To: lng-odp@lists.linaro.org
> Subject: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add
> odp_sync_loads
> 
> Updated odp_sync_stores() specification and added odp_sync_loads
> to pair it. Used GCC __atomic_thread_fence to implement both of
> those.
> 
> Signed-off-by: Petri Savolainen 
> ---
>  include/odp/api/sync.h| 53 +++---
> -
>  platform/linux-generic/include/odp/sync.h | 18 +++
>  2 files changed, 43 insertions(+), 28 deletions(-)
> 
> diff --git a/include/odp/api/sync.h b/include/odp/api/sync.h
> index 6477e74..8456f68 100644
> --- a/include/odp/api/sync.h
> +++ b/include/odp/api/sync.h
> @@ -23,37 +23,34 @@ extern "C" {
>   */
> 
>  /**
> - * Synchronise stores
> + * Synchronize stores
>   *
> - * Ensures that all CPU store operations that precede the
> odp_sync_stores()
> - * call are globally visible before any store operation that follows it.
> + * This call implements a write memory barrier between threads. It
> ensures that
> + * all (non-atomic or relaxed atomic) stores (from the calling thread)
> that
> + * precede this call are globally visible before any store operation that
> + * follows it. It prevents stores moving from before the call to after
> it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, unlocks, queue enqueues)
> + * include write barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_loads()
>   */
> -static inline void odp_sync_stores(void)
> -{
> -#if defined __x86_64__ || defined __i386__
> -
> - __asm__  __volatile__ ("sfence\n" : : : "memory");
> -
> -#elif defined(__arm__)
> -#if __ARM_ARCH == 6
> - __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 5" \
> - : : "r" (0) : "memory");
> -#elif __ARM_ARCH >= 7 || defined __aarch64__
> -
> - __asm__ __volatile__ ("dmb st" : : : "memory");
> -#else
> - __asm__ __volatile__ ("" : : : "memory");
> -#endif
> -
> -#elif defined __OCTEON__
> -
> - __asm__  __volatile__ ("syncws\n" : : : "memory");
> -
> -#else
> - __sync_synchronize();
> -#endif
> -}
> +void odp_sync_stores(void);
> 
> +/**
> + * Synchronize loads
> + *
> + * This call implements a read memory barrier. It ensures that all (non-
> atomic
> + * or relaxed atomic) loads that precede this call happen before any load
> + * operation that follows it. It prevents loads moving from after the
> call to
> + * before it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> + * include read barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_stores()
> + */
> +void odp_sync_loads(void);
> 
>  /**
>   * @}
> diff --git a/platform/linux-generic/include/odp/sync.h b/platform/linux-
> generic/include/odp/sync.h
> index bc73083..09b3939 100644
> --- a/platform/linux-generic/include/odp/sync.h
> +++ b/platform/linux-generic/include/odp/sync.h
> @@ -17,6 +17,24 @@
>  extern "C" {
>  #endif
> 
> +/** @ingroup odp_barrier
> + *  @{
> + */
> +
> +static inline void odp_sync_stores(void)
> +{
> + __atomic_thread_fence(__ATOMIC_RELEASE);
> +}
> +
> +static inline void odp_sync_loads(void)
> +{
> + __atomic_thread_fence(__ATOMIC_ACQUIRE);
> +}
> +
> +/**
> + * @}
> + */
> +
>  #include 
> 
>  #ifdef __cplusplus
> --
> 2.6.2
> 
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-26 Thread Ola Liljedahl
On 12 November 2015 at 11:03, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia.com> wrote:
>> > if (msg == UPDATE_FOO) {
>> > // Make sure that load of “foo” is not moved before schedule()
>> > odp_sync_loads();
>>
>> odp_sync_acquire() would be a better name.
>>
>> >
>> > foo++;
>> > // Make sure that store of “foo” is not moved after queue_enq()
>> >odp_sync_stores();
>>
>> odp_sync_release() would be a better name.
>>
>> I am somewhat OK with odp_sync_acquire/odp_sync_release (should also
>> take vector of memory segments per suggestion below) as long as it is
>> required that producer and consumer use the same ODP calls.
>
> Naming could be acq/rel instead of loads/stores, if those are considered
as well established terms. The problem with acq/rel names are that this
call does not do the acq or rel operation. Also the synchronization
functionality is a bit different:
>
> "While an atomic store-release operation prevents all preceding writes
from moving past the *store-release*, an atomic_thread_fence with
memory_order_release ordering prevents *all preceding writes from moving
past all subsequent stores*".
>
> Thread fence synchronizes between *all stores* before and *all stores*
after the fence - thus the name odp_sync_stores() fits better.
>
> Store-release synchronizes stores before the operation with the
operation, but not with *all other* stores following the operation.
Yes release ordering is associated with a specific store-release
instruction which is not explicit when using atomic_thread_fence. The
reason why I still thought odp_sync_release is a better name is that it
also ought to prevent loads from before the sync operation to move down
past the fence. If loads are not ordered, you could be reading data written
by another CPU after the store fence. I don't know how likely this is in
practice, this depends on actual CPU implementations.

The GCC inline asm uses the memory clobber that will prevent the compiler
from moving loads but it does not prevent the HW from completing the load
after the store fence which could happen if you load something (perhaps for
later use) before the fence and miss in the cache).

The definition ought to be clear on that loads before the fence are also
ordered (have completed) before all stores after the fence. And the
implementation(s) need to implement this behaviour. ARM DMB without ST
option orders both loads and stores. And as I wrote, if synchronisation
between producer and consumer is not done using (coherent) shared memory
but through some other mechanism (e.g. device write which generates
interrupt), DSB is needed instead of DMB.

>
>
>
>>
>> > msg = FOO_UPDATED;
>> > queue_enq(msg);
>> > }
>> >
>> >  ev =
>> schedule();
>> >  msg
>> = event_to_message(ev);
>> >
>> >  if
>> (msg == FOO_UPDATED) {
>> >
>> // Make sure that load of “foo” is not moved before schedule()
>> >
>> odp_sync_loads();
>> >
>> printf(“foo is now %\n”, foo);
>> >  }
>> >
>> >
>> > As you can see it would a lot of waste if application must always
>> sync for stores and loads explicitly.
>> > I see your point.
>> > I re-read carefully the documentation and maybe we can work around
>> it.
>> > The sync_stores explicitly states that all store operations are
>> globally visible. So cache needs to be flushed.
>>
>> If the Kalray HW isn't cache coherent between CPU's, neither old-style
>> (volatile + barriers) or C11-style shared memory programming will
>> work. Probably some platform specific code is necessary in order to
>> share data between CPU's and to properly synchronise the access.
>> Perhaps the acquire and release operations should take a vector of
>> addresses and sizes, specifying which data is being shared so to
>> minimise cache flushing (in the producer) and invalidation (in the
>> consumer).
>
> Vector / non-coherent versions would another set of functions. These two
should be just regular memory fences for coherent memory systems.
>
>
>>
>> >
>> > The load part only states that previous memory operations are
>> complete, and cannot be reordered around it.
>> > It does not explicitly states that the cache needs to be up-to-date.
>> We internally use "read memory barrier" internally as DINVAL + FENCE
>> but it is not necessary the case.
>> >
>> > If my understanding of the function fits with yours, I'm good with
>> this patches. If not, we need to discuss things some more.
>> >
>> > Nicolas
>> >
>> >
>> >
>> > The above highlights how it should *not* be used (== not needed with
>> ready-made ODP sync mechanisms). It’s intended to be used with non-ODP
>> sync mechanisms like the one under. Store and load syncs are usually a
>> pair (sync_stores -> sync_loads).
>>
>> If either or both 

Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-12 Thread Savolainen, Petri (Nokia - FI/Espoo)
> > if (msg == UPDATE_FOO) {
> > // Make sure that load of “foo” is not moved before schedule()
> > odp_sync_loads();
> 
> odp_sync_acquire() would be a better name.
> 
> >
> > foo++;
> > // Make sure that store of “foo” is not moved after queue_enq()
> >odp_sync_stores();
> 
> odp_sync_release() would be a better name.
> 
> I am somewhat OK with odp_sync_acquire/odp_sync_release (should also
> take vector of memory segments per suggestion below) as long as it is
> required that producer and consumer use the same ODP calls.

Naming could be acq/rel instead of loads/stores, if those are considered as 
well established terms. The problem with acq/rel names are that this call does 
not do the acq or rel operation. Also the synchronization functionality is a 
bit different:

"While an atomic store-release operation prevents all preceding writes from 
moving past the *store-release*, an atomic_thread_fence with 
memory_order_release ordering prevents *all preceding writes from moving past 
all subsequent stores*".

Thread fence synchronizes between *all stores* before and *all stores* after 
the fence - thus the name odp_sync_stores() fits better.

Store-release synchronizes stores before the operation with the operation, but 
not with *all other* stores following the operation.



> 
> > msg = FOO_UPDATED;
> > queue_enq(msg);
> > }
> >
> >  ev =
> schedule();
> >  msg
> = event_to_message(ev);
> >
> >  if
> (msg == FOO_UPDATED) {
> >
> // Make sure that load of “foo” is not moved before schedule()
> >
> odp_sync_loads();
> >
> printf(“foo is now %\n”, foo);
> >  }
> >
> >
> > As you can see it would a lot of waste if application must always
> sync for stores and loads explicitly.
> > I see your point.
> > I re-read carefully the documentation and maybe we can work around
> it.
> > The sync_stores explicitly states that all store operations are
> globally visible. So cache needs to be flushed.
> 
> If the Kalray HW isn't cache coherent between CPU's, neither old-style
> (volatile + barriers) or C11-style shared memory programming will
> work. Probably some platform specific code is necessary in order to
> share data between CPU's and to properly synchronise the access.
> Perhaps the acquire and release operations should take a vector of
> addresses and sizes, specifying which data is being shared so to
> minimise cache flushing (in the producer) and invalidation (in the
> consumer).

Vector / non-coherent versions would another set of functions. These two should 
be just regular memory fences for coherent memory systems.


> 
> >
> > The load part only states that previous memory operations are
> complete, and cannot be reordered around it.
> > It does not explicitly states that the cache needs to be up-to-date.
> We internally use "read memory barrier" internally as DINVAL + FENCE
> but it is not necessary the case.
> >
> > If my understanding of the function fits with yours, I'm good with
> this patches. If not, we need to discuss things some more.
> >
> > Nicolas
> >
> >
> >
> > The above highlights how it should *not* be used (== not needed with
> ready-made ODP sync mechanisms). It’s intended to be used with non-ODP
> sync mechanisms like the one under. Store and load syncs are usually a
> pair (sync_stores -> sync_loads).
> 
> If either or both of the producer and consumer cannot use some common
> ODP mechanism, the mechanism they actually use must be well defined
> (e.g. must use DMB or DSB on ARM) so that both sides pair properly. So
> for one side to use e.g. odp_sync_stores() (without knowing the actual
> implementation) and the other side use whatever mechanism (not
> odp_sync_loads) is a potential cause of problems. It is difficult for
> me to see a generic implementation (in ODP) that can always pair with
> whatever undefined mechanism is used by the peer. I think the code in
> an ODP application that synchronises with a non-ODP producer or
> consumer must use the specific mechanism agreed on and ODP cannot
> provide any generic function that will work regardless.

It should be enough that we specify this to function the same way as C11 
atomic_thread_fence. There's the spec the other side needs to understand what 
to do. This is what my patch defines in practice.

Did you see actual text in the patch? It's clipped out it this thread.


> 
> >
> >
> > -Petri
> >
> >
> > // data and flag in shared memory
> > volatile int flag = 0;
> 
> int flag = 0; is enough
> 
> >
> > int foo = 0;
> >
> > Thread A Thread B
> >
> > foo = 0xbeef;
> >
> > // Make sure that store
> > // of "foo" is visible before
> > // store of "flag"
> > odp_sync_stores();
> > flag = 1;
> 
> Must be implemented as:
> 

Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-11 Thread Ola Liljedahl
On 9 November 2015 at 10:39, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia.com> wrote:
>
>
>
> From: EXT Nicolas Morey-Chaisemartin [mailto:nmo...@kalray.eu]
> Sent: Thursday, November 05, 2015 6:23 PM
> To: Savolainen, Petri (Nokia - FI/Espoo); Bill Fischofer
> Cc: LNG ODP Mailman List
> Subject: Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add 
> odp_sync_loads
>
>
> On 11/05/2015 04:54 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> Majority of platforms are cache coherent, for a  non-coherent platform you’d 
> need to define some additional API calls. This is documenting the current 
> assumption that application does this …

Depending on how you signal the consumer of the shared data, you might
need different barriers. E.g. if signalling the consumer by writing to
some device register which generates an interrupt to the consumer, on
ARM you need DSB, DMB is normally not enough. I doubt
__atomic_thread_fence generates a DSB instruction. So defining a
generic mechanism which can be used regardless of the signalling
method would require the strongest barrier. But this might still not
be enough for the actual scenarios which these calls are intended for.

If there is no synchronisation between producer and consumer, you have
a data race and the program behaviour is undefined. The only
work-around is to make *all* shared data atomically read and written,
variable by variable. But then you couldn't atomically read a set of
variables (if that is important).

>
> // data in shared memory
> int foo;
>
> Thread A  Thread B
>
> ev = schedule();
> msg = event_to_message(ev);
>
> if (msg == UPDATE_FOO) {
> foo++;
> msg = FOO_UPDATED;
> queue_enq(msg);

release operation with release ordering, all stores before queue_enq()
observable when the msg event can be observe (e.g. by dequeueing it
explicitly or from odp_schedule). Also guarantees that loads before
queue_enq() have completed so that a load that misses in the cache
will not read some data that was updated after queue_enq().

> }
>
>  ev = 
> schedule();

acquire operation, guarantees that following loads (and stores but
they are seldom speculated) will not be issued before the event has
been seen and properly "acquired".

>
>  msg = 
> event_to_message(ev);
>
>  if (msg == 
> FOO_UPDATED) {
>  
> printf(“foo is now %\n”, foo);
>  }
>
>
> … intead of this …
>
> Thread A  Thread B
>
> ev = schedule();
> msg = event_to_message(ev);
>
> if (msg == UPDATE_FOO) {
> // Make sure that load of “foo” is not moved before schedule()
> odp_sync_loads();

odp_sync_acquire() would be a better name.

>
> foo++;
> // Make sure that store of “foo” is not moved after queue_enq()
>odp_sync_stores();

odp_sync_release() would be a better name.

I am somewhat OK with odp_sync_acquire/odp_sync_release (should also
take vector of memory segments per suggestion below) as long as it is
required that producer and consumer use the same ODP calls.

> msg = FOO_UPDATED;
> queue_enq(msg);
> }
>
>  ev = 
> schedule();
>  msg = 
> event_to_message(ev);
>
>  if (msg == 
> FOO_UPDATED) {
>  // Make 
> sure that load of “foo” is not moved before schedule()
>  
> odp_sync_loads();
>  
> printf(“foo is now %\n”, foo);
>  }
>
>
> As you can see it would a lot of waste if application must always sync for 
> stores and loads explicitly.
> I see your point.
> I re-read carefully the documentation and maybe we can work around it.
> The sync_stores explicitly states that all store operations are globally 
> visible. So cache needs to be flushed.

If the Kalray HW isn't cache coherent between CPU's, neither old-style
(volatile + barriers) or C11-style shared memory programming will
work. Probably some platform specific code is necessary in order to
share data between CPU's and to properly synchronise t

Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-09 Thread Savolainen, Petri (Nokia - FI/Espoo)


From: EXT Nicolas Morey-Chaisemartin [mailto:nmo...@kalray.eu] 
Sent: Thursday, November 05, 2015 6:23 PM
To: Savolainen, Petri (Nokia - FI/Espoo); Bill Fischofer
Cc: LNG ODP Mailman List
Subject: Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add 
odp_sync_loads


On 11/05/2015 04:54 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
Majority of platforms are cache coherent, for a  non-coherent platform you’d 
need to define some additional API calls. This is documenting the current 
assumption that application does this …
 
// data in shared memory
int foo;
 
Thread A  Thread B
 
ev = schedule();
msg = event_to_message(ev);
 
if (msg == UPDATE_FOO) {
    foo++;
    msg = FOO_UPDATED;
    queue_enq(msg);
}
 
 ev = 
schedule();
 msg = 
event_to_message(ev);
 
 if (msg == 
FOO_UPDATED) {
     
printf(“foo is now %\n”, foo);
 }
 
 
… intead of this …
 
Thread A  Thread B
 
ev = schedule();
msg = event_to_message(ev);
 
if (msg == UPDATE_FOO) {
    // Make sure that load of “foo” is not moved before schedule()
    odp_sync_loads();
    foo++;
    // Make sure that store of “foo” is not moved after queue_enq()
   odp_sync_stores();
    msg = FOO_UPDATED;
    queue_enq(msg);
}
 
 ev = 
schedule();
 msg = 
event_to_message(ev);
 
 if (msg == 
FOO_UPDATED) {
 // Make 
sure that load of “foo” is not moved before schedule()
 
odp_sync_loads();
     
printf(“foo is now %\n”, foo);
 }
 
 
As you can see it would a lot of waste if application must always sync for 
stores and loads explicitly.
I see your point.
I re-read carefully the documentation and maybe we can work around it.
The sync_stores explicitly states that all store operations are globally 
visible. So cache needs to be flushed.
The load part only states that previous memory operations are complete, and 
cannot be reordered around it.
It does not explicitly states that the cache needs to be up-to-date. We 
internally use "read memory barrier" internally as DINVAL + FENCE but it is not 
necessary the case.

If my understanding of the function fits with yours, I'm good with this 
patches. If not, we need to discuss things some more.

Nicolas



The above highlights how it should *not* be used (== not needed with ready-made 
ODP sync mechanisms). It’s intended to be used with non-ODP sync mechanisms 
like the one under. Store and load syncs are usually a pair (sync_stores -> 
sync_loads).

-Petri


// data and flag in shared memory
volatile int flag = 0;
int foo = 0;

Thread A Thread B

foo = 0xbeef;

// Make sure that store
// of "foo" is visible before 
// store of "flag"
odp_sync_stores();
flag = 1;

 while (flag == 0)
 spin_and_wait();

 // Make sure that load of
 // "foo" does not happen before
 // we see the flag changing
 odp_sync_loads();

 printf(“foo is now %\n”, foo);





 
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Savolainen, Petri (Nokia - FI/Espoo)
Ping.


> -Original Message-
> From: lng-odp [mailto:lng-odp-boun...@lists.linaro.org] On Behalf Of
> EXT Petri Savolainen
> Sent: Monday, October 26, 2015 6:08 PM
> To: lng-odp@lists.linaro.org
> Subject: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add
> odp_sync_loads
> 
> Updated odp_sync_stores() specification and added odp_sync_loads
> to pair it. Used GCC __atomic_thread_fence to implement both of
> those.
> 
> Signed-off-by: Petri Savolainen 
> ---
>  include/odp/api/sync.h| 53 +++
> 
>  platform/linux-generic/include/odp/sync.h | 18 +++
>  2 files changed, 43 insertions(+), 28 deletions(-)
> 
> diff --git a/include/odp/api/sync.h b/include/odp/api/sync.h
> index 6477e74..8456f68 100644
> --- a/include/odp/api/sync.h
> +++ b/include/odp/api/sync.h
> @@ -23,37 +23,34 @@ extern "C" {
>   */
> 
>  /**
> - * Synchronise stores
> + * Synchronize stores
>   *
> - * Ensures that all CPU store operations that precede the
> odp_sync_stores()
> - * call are globally visible before any store operation that follows
> it.
> + * This call implements a write memory barrier between threads. It
> ensures that
> + * all (non-atomic or relaxed atomic) stores (from the calling thread)
> that
> + * precede this call are globally visible before any store operation
> that
> + * follows it. It prevents stores moving from before the call to after
> it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, unlocks, queue
> enqueues)
> + * include write barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_loads()
>   */
> -static inline void odp_sync_stores(void)
> -{
> -#if defined __x86_64__ || defined __i386__
> -
> - __asm__  __volatile__ ("sfence\n" : : : "memory");
> -
> -#elif defined(__arm__)
> -#if __ARM_ARCH == 6
> - __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 5" \
> - : : "r" (0) : "memory");
> -#elif __ARM_ARCH >= 7 || defined __aarch64__
> -
> - __asm__ __volatile__ ("dmb st" : : : "memory");
> -#else
> - __asm__ __volatile__ ("" : : : "memory");
> -#endif
> -
> -#elif defined __OCTEON__
> -
> - __asm__  __volatile__ ("syncws\n" : : : "memory");
> -
> -#else
> - __sync_synchronize();
> -#endif
> -}
> +void odp_sync_stores(void);
> 
> +/**
> + * Synchronize loads
> + *
> + * This call implements a read memory barrier. It ensures that all
> (non-atomic
> + * or relaxed atomic) loads that precede this call happen before any
> load
> + * operation that follows it. It prevents loads moving from after the
> call to
> + * before it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, locks, queue
> dequeues)
> + * include read barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_stores()
> + */
> +void odp_sync_loads(void);
> 
>  /**
>   * @}
> diff --git a/platform/linux-generic/include/odp/sync.h
> b/platform/linux-generic/include/odp/sync.h
> index bc73083..09b3939 100644
> --- a/platform/linux-generic/include/odp/sync.h
> +++ b/platform/linux-generic/include/odp/sync.h
> @@ -17,6 +17,24 @@
>  extern "C" {
>  #endif
> 
> +/** @ingroup odp_barrier
> + *  @{
> + */
> +
> +static inline void odp_sync_stores(void)
> +{
> + __atomic_thread_fence(__ATOMIC_RELEASE);
> +}
> +
> +static inline void odp_sync_loads(void)
> +{
> + __atomic_thread_fence(__ATOMIC_ACQUIRE);
> +}
> +
> +/**
> + * @}
> + */
> +
>  #include 
> 
>  #ifdef __cplusplus
> --
> 2.6.2
> 
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Savolainen, Petri (Nokia - FI/Espoo)
Majority of platforms are cache coherent, for a  non-coherent platform you’d 
need to define some additional API calls. This is documenting the current 
assumption that application does this …

// data in shared memory
int foo;

Thread A  Thread B

ev = schedule();
msg = event_to_message(ev);

if (msg == UPDATE_FOO) {
foo++;
msg = FOO_UPDATED;
queue_enq(msg);
}

 ev = 
schedule();
 msg = 
event_to_message(ev);

 if (msg == 
FOO_UPDATED) {
 
printf(“foo is now %\n”, foo);
 }


… intead of this …

Thread A  Thread B

ev = schedule();
msg = event_to_message(ev);

if (msg == UPDATE_FOO) {
// Make sure that load of “foo” is not moved before schedule()
odp_sync_loads();
foo++;
// Make sure that store of “foo” is not moved after queue_enq()
   odp_sync_stores();
msg = FOO_UPDATED;
queue_enq(msg);
}

 ev = 
schedule();
 msg = 
event_to_message(ev);

 if (msg == 
FOO_UPDATED) {
 // Make 
sure that load of “foo” is not moved before schedule()
 
odp_sync_loads();
 
printf(“foo is now %\n”, foo);
 }


As you can see it would a lot of waste if application must always sync for 
stores and loads explicitly.

-Petri


From: EXT Nicolas Morey-Chaisemartin [mailto:nmo...@kalray.eu]
Sent: Thursday, November 05, 2015 5:23 PM
To: Bill Fischofer
Cc: Savolainen, Petri (Nokia - FI/Espoo); LNG ODP Mailman List
Subject: Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add 
odp_sync_loads


On 11/05/2015 04:21 PM, Bill Fischofer wrote:
Perhaps the note could be changed to read:

ODP synchronization mechanisms (e.g., barrier, locks, queue dequeues) are 
guaranteed to be coherent, so this call is not needed when using those.

No need to specify how coherency is achieved on a given platform for these ops.

This sounds much better.



On Thu, Nov 5, 2015 at 9:05 AM, Nicolas Morey-Chaisemartin 
<nmo...@kalray.eu<mailto:nmo...@kalray.eu>> wrote:
Sorry for the late feedback. I missed this one

On 10/26/2015 05:07 PM, Petri Savolainen wrote:
> Updated odp_sync_stores() specification and added odp_sync_loads
> to pair it. Used GCC __atomic_thread_fence to implement both of
> those.
>
> Signed-off-by: Petri Savolainen 
> <petri.savolai...@nokia.com<mailto:petri.savolai...@nokia.com>>
> ---
...
> +/**
> + * Synchronize loads
> + *
> + * This call implements a read memory barrier. It ensures that all 
> (non-atomic
> + * or relaxed atomic) loads that precede this call happen before any load
> + * operation that follows it. It prevents loads moving from after the call to
> + * before it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> + * include read barrier, so this call is not needed when using those.
> + *
The API here is fine. What bothers me is the footnote about all ODP sync 
mechanisms calling this.
Because Kalray architecture does not have cache coherency, a read memory 
barrier is *very* expensive.
We have to invalidate the complete cache (cheap) but then refill it later.

What our current implementation does is simply invalidate the appropriate 
structure as needed.
barriers and locks only cause write memory barrier. Queue dequeue also ensure 
that the dequeued struct (packet, timer, buffer, etc.) are up to date with the 
other threads and devices.

___
lng-odp mailing list
lng-odp@lists.linaro.org<mailto:lng-odp@lists.linaro.org>
https://lists.linaro.org/mailman/listinfo/lng-odp


___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Nicolas Morey-Chaisemartin


On 11/05/2015 04:54 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>
> Majority of platforms are cache coherent, for a  non-coherent platform you’d 
> need to define some additional API calls. This is documenting the current 
> assumption that application does this …
>
>  
>
> // data in shared memory
>
> int foo;
>
>  
>
> Thread A  Thread B
>
>  
>
> ev = schedule();
>
> msg = event_to_message(ev);
>
>  
>
> if (msg == UPDATE_FOO) {
>
> foo++;
>
> msg = FOO_UPDATED;
>
> queue_enq(msg);
>
> }
>
>  
>
>  ev = 
> schedule();
>
>  msg = 
> event_to_message(ev);
>
>  
>
>  if (msg == 
> FOO_UPDATED) {
>
>  
> printf(“foo is now %\n”, foo);
>
>  }
>
>  
>
>  
>
> … intead of this …
>
>  
>
> Thread A  Thread B
>
>  
>
> ev = schedule();
>
> msg = event_to_message(ev);
>
>  
>
> if (msg == UPDATE_FOO) {
>
> *// Make sure that load of “foo” is not moved before schedule()*
>
> *odp_sync_loads();*
>
> foo++;
>
> *// Make sure that store of “foo” is not moved after queue_enq()*
>
>*odp_sync_stores();*
>
> msg = FOO_UPDATED;
>
> queue_enq(msg);
>
> }
>
>  
>
>  ev = 
> schedule();
>
>  msg = 
> event_to_message(ev);
>
>  
>
>  if (msg == 
> FOO_UPDATED) {
>
> * // Make 
> sure that load of “foo” is not moved before schedule()*
>
> * 
> odp_sync_loads();*
>
>  
> printf(“foo is now %\n”, foo);
>
>  }
>
>  
>
>  
>
> As you can see it would a lot of waste if application must always sync for 
> stores and loads explicitly.
>
I see your point.
I re-read carefully the documentation and maybe we can work around it.
The sync_stores explicitly states that all store operations are globally 
visible. So cache needs to be flushed.
The load part only states that previous memory operations are complete, and 
cannot be reordered around it.
It does not explicitly states that the cache needs to be up-to-date. We 
internally use "read memory barrier" internally as DINVAL + FENCE but it is not 
necessary the case.

If my understanding of the function fits with yours, I'm good with this 
patches. If not, we need to discuss things some more.

Nicolas

 

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Nicolas Morey-Chaisemartin
Sorry for the late feedback. I missed this one

On 10/26/2015 05:07 PM, Petri Savolainen wrote:
> Updated odp_sync_stores() specification and added odp_sync_loads
> to pair it. Used GCC __atomic_thread_fence to implement both of
> those.
>
> Signed-off-by: Petri Savolainen 
> ---
...
> +/**
> + * Synchronize loads
> + *
> + * This call implements a read memory barrier. It ensures that all 
> (non-atomic
> + * or relaxed atomic) loads that precede this call happen before any load
> + * operation that follows it. It prevents loads moving from after the call to
> + * before it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> + * include read barrier, so this call is not needed when using those.
> + *
The API here is fine. What bothers me is the footnote about all ODP sync 
mechanisms calling this.
Because Kalray architecture does not have cache coherency, a read memory 
barrier is *very* expensive.
We have to invalidate the complete cache (cheap) but then refill it later.

What our current implementation does is simply invalidate the appropriate 
structure as needed.
barriers and locks only cause write memory barrier. Queue dequeue also ensure 
that the dequeued struct (packet, timer, buffer, etc.) are up to date with the 
other threads and devices.

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Bill Fischofer
Perhaps the note could be changed to read:

ODP synchronization mechanisms (e.g., barrier, locks, queue dequeues) are
guaranteed to be coherent, so this call is not needed when using those.

No need to specify how coherency is achieved on a given platform for these
ops.

On Thu, Nov 5, 2015 at 9:05 AM, Nicolas Morey-Chaisemartin  wrote:

> Sorry for the late feedback. I missed this one
>
> On 10/26/2015 05:07 PM, Petri Savolainen wrote:
> > Updated odp_sync_stores() specification and added odp_sync_loads
> > to pair it. Used GCC __atomic_thread_fence to implement both of
> > those.
> >
> > Signed-off-by: Petri Savolainen 
> > ---
> ...
> > +/**
> > + * Synchronize loads
> > + *
> > + * This call implements a read memory barrier. It ensures that all
> (non-atomic
> > + * or relaxed atomic) loads that precede this call happen before any
> load
> > + * operation that follows it. It prevents loads moving from after the
> call to
> > + * before it.
> > + *
> > + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> > + * include read barrier, so this call is not needed when using those.
> > + *
> The API here is fine. What bothers me is the footnote about all ODP sync
> mechanisms calling this.
> Because Kalray architecture does not have cache coherency, a read memory
> barrier is *very* expensive.
> We have to invalidate the complete cache (cheap) but then refill it later.
>
> What our current implementation does is simply invalidate the appropriate
> structure as needed.
> barriers and locks only cause write memory barrier. Queue dequeue also
> ensure that the dequeued struct (packet, timer, buffer, etc.) are up to
> date with the other threads and devices.
>
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Nicolas Morey-Chaisemartin


On 11/05/2015 04:21 PM, Bill Fischofer wrote:
> Perhaps the note could be changed to read:
>
> ODP synchronization mechanisms (e.g., barrier, locks, queue dequeues) are 
> guaranteed to be coherent, so this call is not needed when using those.
>
> No need to specify how coherency is achieved on a given platform for these 
> ops.

This sounds much better.

>
> On Thu, Nov 5, 2015 at 9:05 AM, Nicolas Morey-Chaisemartin  > wrote:
>
> Sorry for the late feedback. I missed this one
>
> On 10/26/2015 05:07 PM, Petri Savolainen wrote:
> > Updated odp_sync_stores() specification and added odp_sync_loads
> > to pair it. Used GCC __atomic_thread_fence to implement both of
> > those.
> >
> > Signed-off-by: Petri Savolainen  >
> > ---
> ...
> > +/**
> > + * Synchronize loads
> > + *
> > + * This call implements a read memory barrier. It ensures that all 
> (non-atomic
> > + * or relaxed atomic) loads that precede this call happen before any 
> load
> > + * operation that follows it. It prevents loads moving from after the 
> call to
> > + * before it.
> > + *
> > + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> > + * include read barrier, so this call is not needed when using those.
> > + *
> The API here is fine. What bothers me is the footnote about all ODP sync 
> mechanisms calling this.
> Because Kalray architecture does not have cache coherency, a read memory 
> barrier is *very* expensive.
> We have to invalidate the complete cache (cheap) but then refill it later.
>
> What our current implementation does is simply invalidate the appropriate 
> structure as needed.
> barriers and locks only cause write memory barrier. Queue dequeue also 
> ensure that the dequeued struct (packet, timer, buffer, etc.) are up to date 
> with the other threads and devices.
>
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org 
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
>

___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp


Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

2015-11-05 Thread Bill Fischofer
On Mon, Oct 26, 2015 at 11:07 AM, Petri Savolainen <
petri.savolai...@nokia.com> wrote:

> Updated odp_sync_stores() specification and added odp_sync_loads
> to pair it. Used GCC __atomic_thread_fence to implement both of
> those.
>
> Signed-off-by: Petri Savolainen 
>

Reviewed-by: Bill Fischofer 


> ---
>  include/odp/api/sync.h| 53
> +++
>  platform/linux-generic/include/odp/sync.h | 18 +++
>  2 files changed, 43 insertions(+), 28 deletions(-)
>
> diff --git a/include/odp/api/sync.h b/include/odp/api/sync.h
> index 6477e74..8456f68 100644
> --- a/include/odp/api/sync.h
> +++ b/include/odp/api/sync.h
> @@ -23,37 +23,34 @@ extern "C" {
>   */
>
>  /**
> - * Synchronise stores
> + * Synchronize stores
>   *
> - * Ensures that all CPU store operations that precede the
> odp_sync_stores()
> - * call are globally visible before any store operation that follows it.
> + * This call implements a write memory barrier between threads. It
> ensures that
> + * all (non-atomic or relaxed atomic) stores (from the calling thread)
> that
> + * precede this call are globally visible before any store operation that
> + * follows it. It prevents stores moving from before the call to after it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, unlocks, queue enqueues)
> + * include write barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_loads()
>   */
> -static inline void odp_sync_stores(void)
> -{
> -#if defined __x86_64__ || defined __i386__
> -
> -   __asm__  __volatile__ ("sfence\n" : : : "memory");
> -
> -#elif defined(__arm__)
> -#if __ARM_ARCH == 6
> -   __asm__ __volatile__ ("mcr p15, 0, %0, c7, c10, 5" \
> -   : : "r" (0) : "memory");
> -#elif __ARM_ARCH >= 7 || defined __aarch64__
> -
> -   __asm__ __volatile__ ("dmb st" : : : "memory");
> -#else
> -   __asm__ __volatile__ ("" : : : "memory");
> -#endif
> -
> -#elif defined __OCTEON__
> -
> -   __asm__  __volatile__ ("syncws\n" : : : "memory");
> -
> -#else
> -   __sync_synchronize();
> -#endif
> -}
> +void odp_sync_stores(void);
>
> +/**
> + * Synchronize loads
> + *
> + * This call implements a read memory barrier. It ensures that all
> (non-atomic
> + * or relaxed atomic) loads that precede this call happen before any load
> + * operation that follows it. It prevents loads moving from after the
> call to
> + * before it.
> + *
> + * ODP synchronization mechanisms (e.g. barrier, locks, queue dequeues)
> + * include read barrier, so this call is not needed when using those.
> + *
> + * @see odp_sync_stores()
> + */
> +void odp_sync_loads(void);
>
>  /**
>   * @}
> diff --git a/platform/linux-generic/include/odp/sync.h
> b/platform/linux-generic/include/odp/sync.h
> index bc73083..09b3939 100644
> --- a/platform/linux-generic/include/odp/sync.h
> +++ b/platform/linux-generic/include/odp/sync.h
> @@ -17,6 +17,24 @@
>  extern "C" {
>  #endif
>
> +/** @ingroup odp_barrier
> + *  @{
> + */
> +
> +static inline void odp_sync_stores(void)
> +{
> +   __atomic_thread_fence(__ATOMIC_RELEASE);
> +}
> +
> +static inline void odp_sync_loads(void)
> +{
> +   __atomic_thread_fence(__ATOMIC_ACQUIRE);
> +}
> +
> +/**
> + * @}
> + */
> +
>  #include 
>
>  #ifdef __cplusplus
> --
> 2.6.2
>
> ___
> lng-odp mailing list
> lng-odp@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
___
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp