I have some problems running jobs with ompi-master on one of our clusters (after doing a major software update). Here are scenarios that work and don't work.

1. Everything still seems to work with 1.8.x series without any issues
2. With master, I can run without any issues single node, multi-process jobs
3. With master, I can run without any issues multi node jobs, as long as there is only a single MPI process per node

4. I can not run multi-node jobs with multiple processes per node, ompi hangs for that scenario. This is independent of whether I enforce using openib or tcp, and just for the sake of simplicity I attach the output for tcp (there is another openib parameter issue that still linguers, but I will report that later).

Here is the output that I receive if setting btl_base_verbose

---------------snip------------------
gabriel@crill:~> salloc -N 2 -n 4
gabriel@crill:~> mpirun --mca btl tcp,self --mca btl_base_verbose 30 -np 4 ./hello_world
[crill-004:18161] mca: base: components_register: registering btl components
[crill-004:18161] mca: base: components_register: found loaded component self [crill-004:18161] mca: base: components_register: component self register function successful
[crill-004:18161] mca: base: components_register: found loaded component tcp
[crill-004:18161] mca: base: components_register: component tcp register function successful
[crill-004:18161] mca: base: components_open: opening btl components
[crill-004:18161] mca: base: components_open: found loaded component self
[crill-004:18161] mca: base: components_open: component self open function successful
[crill-004:18161] mca: base: components_open: found loaded component tcp
[crill-004:18161] mca: base: components_open: component tcp open function successful
[crill-004:18161] select: initializing btl component self
[crill-004:18161] select: init of component self returned success
[crill-004:18161] select: initializing btl component tcp
[crill-004:18161] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[crill-004:18161] btl: tcp: Found match: 127.0.0.1 (lo)
[crill-004:18161] select: init of component tcp returned success
[crill-003:18962] mca: base: components_register: registering btl components
[crill-003:18962] mca: base: components_register: found loaded component self [crill-003:18962] mca: base: components_register: component self register function successful
[crill-003:18962] mca: base: components_register: found loaded component tcp
[crill-003:18962] mca: base: components_register: component tcp register function successful
[crill-003:18962] mca: base: components_open: opening btl components
[crill-003:18962] mca: base: components_open: found loaded component self
[crill-003:18962] mca: base: components_open: component self open function successful
[crill-003:18962] mca: base: components_open: found loaded component tcp
[crill-003:18962] mca: base: components_open: component tcp open function successful
[crill-003:18963] mca: base: components_register: registering btl components
[crill-003:18963] mca: base: components_register: found loaded component self [crill-003:18963] mca: base: components_register: component self register function successful
[crill-003:18963] mca: base: components_register: found loaded component tcp
[crill-003:18963] mca: base: components_register: component tcp register function successful
[crill-003:18963] mca: base: components_open: opening btl components
[crill-003:18963] mca: base: components_open: found loaded component self
[crill-003:18963] mca: base: components_open: component self open function successful
[crill-003:18963] mca: base: components_open: found loaded component tcp
[crill-003:18963] mca: base: components_open: component tcp open function successful
[crill-003:18964] mca: base: components_register: registering btl components
[crill-003:18964] mca: base: components_register: found loaded component self [crill-003:18964] mca: base: components_register: component self register function successful
[crill-003:18964] mca: base: components_register: found loaded component tcp
[crill-003:18964] mca: base: components_register: component tcp register function successful
[crill-003:18964] mca: base: components_open: opening btl components
[crill-003:18964] mca: base: components_open: found loaded component self
[crill-003:18964] mca: base: components_open: component self open function successful
[crill-003:18964] mca: base: components_open: found loaded component tcp
[crill-003:18964] mca: base: components_open: component tcp open function successful
[crill-003:18962] select: initializing btl component self
[crill-003:18962] select: init of component self returned success
[crill-003:18962] select: initializing btl component tcp
[crill-003:18962] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[crill-003:18962] btl: tcp: Found match: 127.0.0.1 (lo)
[crill-003:18962] select: init of component tcp returned success
[crill-003:18963] select: initializing btl component self
[crill-003:18963] select: init of component self returned success
[crill-003:18963] select: initializing btl component tcp
[crill-003:18963] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[crill-003:18963] btl: tcp: Found match: 127.0.0.1 (lo)
[crill-003:18963] select: init of component tcp returned success
[crill-003:18964] select: initializing btl component self
[crill-003:18964] select: init of component self returned success
[crill-003:18964] select: initializing btl component tcp
[crill-003:18964] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[crill-003:18964] btl: tcp: Found match: 127.0.0.1 (lo)
[crill-003:18964] select: init of component tcp returned success
[crill-003:18964] mca: bml: Using self btl to [[3417,1],2] on node crill-003
[crill-003:18964] mca: bml: Using tcp btl to [[3417,1],0] on node crill-003
[crill-003:18964] mca: bml: Using tcp btl to [[3417,1],1] on node crill-003
[crill-003:18964] mca: bml: Using tcp btl to [[3417,1],3] on node crill-004
[crill-004:18161] mca: bml: Using self btl to [[3417,1],3] on node crill-004
[crill-004:18161] mca: bml: Using tcp btl to [[3417,1],0] on node crill-003
[crill-004:18161] mca: bml: Using tcp btl to [[3417,1],1] on node crill-003
[crill-004:18161] mca: bml: Using tcp btl to [[3417,1],2] on node crill-003
^C

------------
and than it just hangs.

Does anybody have an idea/suggestion what to try or look for?

Thanks
Edgar

--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Reply via email to