Hello, On Thu, Jul 16, 2020 at 6:58 AM Phil Yang <phil.y...@arm.com> wrote: > > Add information about possible optimizations using C11 atomic built-ins.
We are missing a review on this doc update. Thanks. -- David Marchand > > Signed-off-by: Phil Yang <phil.y...@arm.com> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > --- > doc/guides/prog_guide/writing_efficient_code.rst | 59 > +++++++++++++++++++++++- > 1 file changed, 58 insertions(+), 1 deletion(-) > > diff --git a/doc/guides/prog_guide/writing_efficient_code.rst > b/doc/guides/prog_guide/writing_efficient_code.rst > index 849f63e..53a1ca1 100644 > --- a/doc/guides/prog_guide/writing_efficient_code.rst > +++ b/doc/guides/prog_guide/writing_efficient_code.rst > @@ -167,7 +167,13 @@ but with the added cost of lower throughput. > Locks and Atomic Operations > --------------------------- > > -Atomic operations imply a lock prefix before the instruction, > +This section describes some key considerations when using locks and atomic > +operations in the DPDK environment. > + > +Locks > +~~~~~ > + > +On x86, atomic operations imply a lock prefix before the instruction, > causing the processor's LOCK# signal to be asserted during execution of the > following instruction. > This has a big impact on performance in a multicore environment. > > @@ -176,6 +182,57 @@ It can often be replaced by other solutions like > per-lcore variables. > Also, some locking techniques are more efficient than others. > For instance, the Read-Copy-Update (RCU) algorithm can frequently replace > simple rwlocks. > > +Atomic Operations: Use C11 Atomic Built-ins > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +DPDK generic rte_atomic operations are implemented by __sync built-ins. These > +__sync built-ins result in full barriers on aarch64, which are unnecessary > +in many use cases. They can be replaced by __atomic built-ins that conform to > +the C11 memory model and provide finer memory order control. > + > +So replacing the rte_atomic operations with __atomic built-ins might improve > +performance for aarch64 machines. > + > +Some typical optimization cases are listed below: > + > +Atomicity > +^^^^^^^^^ > + > +Some use cases require atomicity alone, the ordering of the memory operations > +does not matter. For example, the packet statistics counters need to be > +incremented atomically but do not need any particular memory ordering. > +So, RELAXED memory ordering is sufficient. > + > +One-way Barrier > +^^^^^^^^^^^^^^^ > + > +Some use cases allow for memory reordering in one way while requiring memory > +ordering in the other direction. > + > +For example, the memory operations before the spinlock lock are allowed to > +move to the critical section, but the memory operations in the critical > section > +are not allowed to move above the lock. In this case, the full memory barrier > +in the compare-and-swap operation can be replaced with ACQUIRE memory order. > +On the other hand, the memory operations after the spinlock unlock are > allowed > +to move to the critical section, but the memory operations in the critical > +section are not allowed to move below the unlock. So the full barrier in the > +store operation can use RELEASE memory order. > + > +Reader-Writer Concurrency > +^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Lock-free reader-writer concurrency is one of the common use cases in DPDK. > + > +The payload or the data that the writer wants to communicate to the reader, > +can be written with RELAXED memory order. However, the guard variable should > +be written with RELEASE memory order. This ensures that the store to guard > +variable is observable only after the store to payload is observable. > + > +Correspondingly, on the reader side, the guard variable should be read > +with ACQUIRE memory order. The payload or the data the writer communicated, > +can be read with RELAXED memory order. This ensures that, if the store to > +guard variable is observable, the store to payload is also observable. > + > Coding Considerations > --------------------- > > -- > 2.7.4 >