Hi, I'm reporting a performance (message rate 16%, latency 3%) regression when using PSM that occurred between OMPI v1.6.5 and v1.8.1. I would guess it affects other networks too, but I haven't tested. The problem stems from the --enable-smp-locks and --enable-opal-multi-threads options.
--enable-smp-locks defaults to enabled and, on x86, causes a 'lock' prefix to be prepended to ASM instructions used by atomic primitives. Disabling removes the 'lock' prefix. In OMPI 1.6.5, --enable-opal-multi-threads defaulted to disabled. When enabled, OPAL would be compiled with multithreading support, which included compiling in calls to atomic primitives. Those atomic primitives, in turn, potentially use a lock prefix (controlled by --enable-smp-locks). SVN r29891 on the trunk changed the above. --enable-opal-multi-threads was removed. CPP macros (#if OPAL_ENABLE_MULTI_THREADS) controlling various calls to atomic primitives were removed, effectively changing the default behavior to multithreading ON for OPAL. This change was then carried to the v1.7 branch in r29944, Fixes #3983. We can use --disable-smp-locks to make the performance regression go away for the builds we ship, but we'd very much prefer if performance was good 'out of the box' for people that grab an OMPI tarball and use it with PSM. My question is, what's the best way to do that? It seems obvious to just make --disable-smp-locks the default, but I presume the change was done on purpose, so I'm looking for community feedback. Thanks, Andrew