If P=1 and Q=1, your setting up a 1x1 matrix which should only need a
single processor. Something tells me you have 4 independent HPL jobs
running, rather than one job using 4 threads. I think you should have
2x2 grid if you want to use 4 threads. For HPL, P * Q = number of cores
being used.
Prentice
On 8/3/20 4:33 AM, John Duffy via users wrote:
Hi
I’m experimenting with hybrid OpenMPI/OpenMP Linpack benchmarks on my
small cluster, and I’m a bit confused as to how to invoke mpirun.
I have compiled/linked HPL-2.3 with OpenMPI and libopenblas-openmp
using the GCC -fopenmp option on Ubuntu 20.04 64-bit.
With P=1 and Q=1 in HPL.dat, if I use…
mpirun -x OMP_NUM_THREADS=4 xhpl
top reports...
top - 08:03:59 up 1 day, 0 min, 1 user, load average: 2.25, 1.23, 0.88
Tasks: 138 total, 2 running, 136 sleeping, 0 stopped, 0 zombie
%Cpu(s): 77.1 us, 22.2 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3793.3 total, 434.0 free, 2814.1 used, 545.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 919.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
5787 john 20 0 2959408 2.6g 8128 R 354.0 69.1 2:10.43
xhpl
5789 john 20 0 263352 9960 7440 S 14.2 0.3 0:07.42
xhpl
5788 john 20 0 263352 9844 7320 S 13.9 0.3 0:07.19
xhpl
5790 john 20 0 263356 9896 7376 S 13.6 0.3 0:07.17
xhpl
… which seems reasonable, but I don’t understand why there are 4 xhpl
processes.
In anticipation of adding more nodes, if I use…
mpirun --host node1 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl
top reports...
top - 07:56:27 up 23:52, 1 user, load average: 1.00, 0.98, 0.68
Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.1 us, 0.0 sy, 0.0 ni, 74.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3793.3 total, 454.2 free, 2794.5 used, 544.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 939.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
5770 john 20 0 2868700 2.5g 7668 R 99.7 68.7 5:20.37
xhpl
… a single xhpl process (as expected), but with only 25% CPU
utilisation and no other processes running on the other 3 cores. It
would appear OpenBLAS is not utilising the 4 cores as expected.
If I then scale it to 2 nodes, with P=1 and Q=2 in HPL.dat...
mpirun --host node1,node2 --map-by ppr:1:node -x OMP_NUM_THREADS=4 xhpl
… similarly, I get a single process on each node, with only 25% CPU
utilisation.
Any advice/suggestions on how to involve mpirun in a hybrid
OpenMPI/OpenMP setup would be appreciated.
Kind regards
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov