Hello,

I am having trouble getting mpi tasks to work when using slurm. There doesn't 
seem to be trouble when I specify the number of nodes and only a few tasks per 
node (e.g. 3 nodes and 2 or 3 tasks each), or if I specify the number of tasks 
altogether (e.g. -n 40). The trouble comes when I divide them up to be 
something like 3 nodes 5+ tasks each. Then the job always hangs.

I am using
pmix 3.1.4
openmpi 4.0.2
slurm 19.05.5
ucx 1.7
hwloc 1.11.9
libevent 2.1.8

I have attached config logs from openmpi and pmix, as well as the output from 
using srun for a batch file. The latest run of the batch file I put a timeout 
so that it would end.

I think I am using the infiniband network (assuming that it is doing so without 
me directly specifying), but I can't comment on it with confidence. (I am very 
new to this).

I believe this to be a mistake on my mpi side, but I can't for the life of me 
figure it out. Any help is appreciated.

I attached a tar file with a few logs and a gist file of the openmpi config log 
-> https://gist.github.com/levidd/48296bd56de7ff6f0995cd9d68fab112




Attachment: debug_info.tar.bz2
Description: debug_info.tar.bz2

Reply via email to