A full memory barrier would actually work for us but the cost is unacceptable 
most of the time.

To get all the performances out of our core, it will be necessary for our ODP 
user to be aware the the non-cache coherency and deal with it manually.
However what I'm looking for (as a first step at least) is a clean way through 
ODP API to get all the standard sunny day application and tests reasonably well.

For that, we can't simply map the odp_mb_* primitives to full memory barriers 
at it will hurt advanced customers too.
What we discussed with Christophe this morning is a somewhat in the middle.

By providing flush/reset primitives on SHMEM, we can have an implementation 
that provide some coherency on the required shared part.
I think it could also be used on specific implementation/arch to prefetch data 
to the cache (when it's coherent).

Nicolas

On 01/14/2016 03:25 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>
> In general, this issue is about - how to support non cache coherent systems. 
> Additional calls specified for non-coherent  systems could be added (to shm 
> and potentially elsewhere) but should be optional for applications, since 
> it’s quite tricky to (efficiently) ensure cache coherency in SW. Also 
> non-coherent systems are already in minority and will be even more so in the 
> future.
>
>  
>
> E.g. single memory barrier / sync (pair) could do the trick in a coherent 
> system, whereas a non-coherent system would need multiple flush/refresh calls 
> (with different pointers). Each barrier (flush/refresh) would hurt 
> performance (compiler optimizer and OoO HW), but more importantly would be 
> painful to maintain (add one new pointer somewhere in your data structure and 
> forget to flush that one address -> stale data -> crash).
>
>  
>
> -Petri
>
>  
>
>  
>
> *From:*EXT Christophe Milard [mailto:christophe.mil...@linaro.org]
> *Sent:* Thursday, January 14, 2016 12:16 PM
> *To:* Hongbo Zhang; Mike Holmes; Petri Savolainen; Anders Roxell; LNG ODP 
> Mailman List; Nicolas Morey-Chaisemartin
> *Subject:* ODP 226: need for shmem->refresh()?
>
>  
>
> This is regarding ODP 226 (Tests assuming pthreads)                           
>   
>
> Kalray is facing a problem actually larger than this thread vs process 
> problem: 
>
> The basic question is: when N processors share the same memory (shmem 
> object),  
>
> is it acceptable to force a cache update for the whole shmem'd area for N-1   
>   
>
> processors, as soon as one single processor updates any byte in the area.     
>   
>
> Typically, one processor will write something in the shmem, and onother will  
>   
>
> want to access it. Just the latter really needs to invalid its cache.         
>   
>
>                                                                               
>   
>
> Kallray typically does a cache invalidation on small ODP objects (e.g. 
> atomics) 
>
> but the cost of doing cache invalidate everywhere on all processors is too 
> high 
>
> for shared memory areas: updates on "shmemed" areas are not automaticaly 
> visible
>
> from other CPUs.
>
>                                                         
>
> They would need something like a "refresh" method on the shmem object. The    
>   
>
> processor which really needs the data would call it.                          
>   
>
> <shmem_object>.refresh() would invalidate the local cache (of that single 
> core),
>
> and possibly initiate a prefetch.                                             
>   
>
> For symetry purpose, maybe a <shmem_object>.flush() is needed, to flush       
>   
>
> pending writes on a shmem object. Kalray has write-through caching at this    
>   
>
> stage so the need is not as bad. Possibly these methods would remap to memory 
>   
>
> barriers on some implementations, but memory barriers are not local to shmem  
>   
>
> objects, and invalidating the whole CPU cache for a single shmem object is    
>   
>
> costly.                                                                       
>   
>
>                                                                               
>   
>
> It would make sense to have a refresh() and flush() acting on the whole       
>   
>
> shmem object. But Kalray pointed out that the price for allocating very small 
>   
>
> shmem fragment is high (shmem area have name, handles and have page           
>   
>
> granularity). Allocating a shmem for a single atomic int is not efficient.    
>   
>
> They would therefore like to see these methods acting on sub-areas of the     
>   
>
> share memory. (or do we see a case for shmem_malloc() here?)                  
>   
>
>                                                                               
>   
>
> Hope this helps understanding the problem...
>
>  
>
> Christophe.                   
>

_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to