Re: [PATCH] Added asm ISB as asm pause for ngx_cpu_pause() for aarch64

Julio Suarez Mon, 11 Dec 2023 15:09:26 -0800

Hi Maxim,

Nitpicking: Added ISB as ngx_cpu_pause() for aarch64.


Yes, we can make that change.

Could you please clarify what do you mean by "a bug"? An emptyngx_cpu_pause() is certainly not a bug, it's just a lack of a moreoptimal solution for the particular architecture.

Agree, not a bug. I'm in a team that focuses on performance, sosub-optimal performance is a "bug" to us. This is not a functional bug.Replacing the word bug with "sub-optimal" is more appropriate.

When a delay like the PAUSE that is there for x86 is added, there is a2-5% increase in the number of requests/sec Arm CPUs can achieve underhigh load.

Yes, the test setup is very similar to what's described here (note,those particular instances in the blog isn't what I tested):


https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3

Also, we tested on Nginx Open Source (without JWT), not Nginx-Plus likein the blog.

We tested for the max RPS of a 512B file that can be pulled through areverse proxy. We select the number of upstreams to be large (8 to beexact), they are also high in core count (16+ CPU). The load generatornode is also large (64 CPUs). This ensures the bottleneck is at thereverse proxy. We test small files because large files make the testnetwork bounded, while smaller files make the test CPU bounded.


I tested both ISB and YIELD (will talk about YIELD further below).

Results of these tests are something like this:

ISB uplift from no delay across 3 runs:

- 2 CPU: 1.03 - 1.22%

- 4 CPU: 2.70 - 10.75% (I'm treating the 10.75% here as an outlier,dropping that 10.75% gets ~5% on the high end of the range, hence whyI'm just saying ~2-5% in change log, I don't want to overstate the perfimprovement)


- 8 CPU: 1.1 -2.33%


YIELD uplift from no delay across 3 runs:

- 2 CPU: 0 - 0.51%

- 4 CPU: 0 - 1.41%

- 8 CPU: 1.05 - 2.31%

ISB produced the highest uplift, particularly for a 4 CPU reverse proxy.Hence why I submitted with ISB. Still, we see benefit with YIELD too.

Variation comes from tearing down cloud infrastructure and redeploying.Results can vary depending on where you land in the data center. I'mintentionally leaving out exactly which HW/cloud I used in this data,but I can say we see similar uplift across a variety of Arm systems.

With respect to using YIELD and other projects that use alternativelyuse ISB:

With respect to ISB Vs YIELD. Yes, as documented, YIELD is theconceptually right thing to use. However, in practice, it's a NOP whichproduces a shorter delay than ISB. Hence why ISB appears to work better.Also, YIELD is intended for SMT systems (uncommon on Arm), and hence,it's going to be a NOP for any current Arm system you'll find in thecloud. That said, YIELD produces uplift in RPS as well because even asmall delay is better than no delay. I'm 100% good with using YIELD ifyou want to stay true to what is currently documented. I was going formax perf uplift which is also why some other projects are also usingISB. Whether it's YIELD or ISB, a revisit with WFET would be in order inthe more distant future. For today, YIELD or ISB would work better thannothing (as it currently is). If YIELD is more acceptable, then I can doYIELD.

Projects that previously used YIELD and switched to ISB after notingperformance improvement (I don't think these projects shared dataanywhere, we just have to take their word):


MongoDB:
https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61

MySQL:
https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108

Jemalloc:
https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436


Could you please clarify reasons for the "memory" clobber here?

Putting in the memory clobber for ISB is redundant because ISB is abarrier itself, but it's probably the GCC appropriate thing to do. Ialso like it as a hint for someone not familiar with ISB. ISB will pausethe frontend (fetch-decode) to allow the CPU backend (execute-retire) tofinish whatever operations are in flight. It's possible that some ofthose operations are writes to memory. Hence why we should tell thecompiler "this instruction may update memory".



__________________________________________________________________

Hello!

Thank you for the patch.
Some comments and questions below.

On Wed, Dec 06, 2023 at 10:06:57AM -0600,julio.sua...@foss.arm.com  wrote:

# HG changeset patch
# User Julio Suarez<julio.sua...@arm.com>
# Date 1701877879 21600
#      Wed Dec 06 09:51:19 2023 -0600
# Node ID 53d289b8676fc678ca90e02ece174300a3631f79
# Parent  f366007dd23a6ce8e8427c1b3042781b618a2ade
Added asm ISB as asm pause for ngx_cpu_pause() for aarch64


Nitpicking:

Added ISB as ngx_cpu_pause() for aarch64.

For aarch64 (A.K.A. Arm64), ngx_cpu_pause() evaluates to empty by the
GCC preprocessor. This results in an empty for loop in the lock
check code in ngx_rwlock.c/ngx_spinlock.c (a bug).
Thus, on Arm CPUs, there is no wait between checks of a lock.


Could you please clarify what do you mean by "a bug"?  An empty
ngx_cpu_pause() is certainly not a bug, it's just a lack of a more
optimal solution for the particular architecture.

When a delay like the PAUSE that is there for x86 is added,
there is a 2-5% increase in the number of requests/sec Arm CPUs
can achieve under high load.


Could you please provide some details on how did you get these
numbers?

Currently, options for a spin lock delay element
are YIELD and ISB. YIELD is assembled into a NOP for most Arm CPUs.
YIELD is for Arm implementations with SMT.
Most Arm implementations are single core - single thread (non-SMT).
Thus, this commit uses ISB (Instruction Barrier Sync) since it
will provide a slightly longer delay than a NOP.

Other projects that implement spin locks have used ISB as
the delay element which has yielded performance gains as well.


Looking through various open source projects, I'm seeing the
following uses on arm64 CPUs:

FreeBSD, sys/arm64/include/cpu.h:

#define cpu_spinwait()          __asm __volatile("yield" ::: "memory")

FreeBSD, lib/libthr/arch/aarch64/include/pthread_md.h:

#define CPU_SPINWAIT

Linux, arch/arm64/include/asm/vdso/processor.h:

static inline void cpu_relax(void)
{
        asm volatile("yield" ::: "memory");
}

The only popular project I was able to find which uses ISB is
Rust:

https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807

From the commit log it looks like it mostly focuses on the delay

introduced by the instruction, ignoring other effects. In
particular, YIELD is expected to be more friendly for various
emulation environments, see Linux commit here:

https://github.com/torvalds/linux/commit/1baa82f48030f38d1895301f1ec93acbcb3d15db

Overall, the YIELD instruction seems to be better suited and
specifically designed for the task in question, as per the
documentation
(https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/YIELD--YIELD-):

: YIELD is a hint instruction. Software with a multithreading
: capability can use a YIELD instruction to indicate to the PE that
: it is performing a task, for example a spin-lock, that could be
: swapped out to improve overall system performance. The PE can use
: this hint to suspend and resume multiple software threads if it
: supports the capability.

Please also note that ngx_cpu_pause() is only used on
multiprocessor systems: the code checks if (ngx_ncpu > 1) before
using it.

Last, ISB is not an architecturally defined solution
for short spin lock delays.
A candidate for a short delay solution is the WFET
instruction (Wait For Event with Timeout).
However, there aren't any Arm implementations in the market that
support this instruction yet. When that occurs,
WFET can be tested as a replacement for ISB. Until then,
ISB will do.

diff -r f366007dd23a -r 53d289b8676f src/os/unix/ngx_atomic.h
--- a/src/os/unix/ngx_atomic.h  Tue Nov 14 15:26:02 2023 +0400
+++ b/src/os/unix/ngx_atomic.h  Wed Dec 06 09:51:19 2023 -0600
@@ -66,6 +66,8 @@

#if ( __i386__ || __i386 || __amd64__ || __amd64 )

  #define ngx_cpu_pause()             __asm__ ("pause")
+#elif ( __aarch64__ )
+#define ngx_cpu_pause()             __asm__ ("isb" ::: "memory")


Could you please clarify reasons for the "memory" clobber here?

  #else
  #define ngx_cpu_pause()
  #endif


-- Maxim Dounin http://mdounin.ru/

_______________________________________________
nginx-devel mailing list
nginx-devel@nginx.org
https://mailman.nginx.org/mailman/listinfo/nginx-devel

Re: [PATCH] Added asm ISB as asm pause for ngx_cpu_pause() for aarch64

Reply via email to