Hello, I am having trouble getting mpi tasks to work when using slurm. There doesn't seem to be trouble when I specify the number of nodes and only a few tasks per node (e.g. 3 nodes and 2 or 3 tasks each), or if I specify the number of tasks altogether (e.g. -n 40). The trouble comes when I divide them up to be something like 3 nodes 5+ tasks each. Then the job always hangs.
I am using pmix 3.1.4 openmpi 4.0.2 slurm 19.05.5 ucx 1.7 hwloc 1.11.9 libevent 2.1.8 I have attached config logs from openmpi and pmix, as well as the output from using srun for a batch file. The latest run of the batch file I put a timeout so that it would end. I think I am using the infiniband network (assuming that it is doing so without me directly specifying), but I can't comment on it with confidence. (I am very new to this). I believe this to be a mistake on my mpi side, but I can't for the life of me figure it out. Any help is appreciated. I attached a tar file with a few logs and a gist file of the openmpi config log -> https://gist.github.com/levidd/48296bd56de7ff6f0995cd9d68fab112
debug_info.tar.bz2
Description: debug_info.tar.bz2