Re: [OMPI devel] problem running jobs on ompi-master
I applied the patch manually and it seemed in fact to resolve the issue, thanks! I must have done the git clone just right before this patch was committed two days back, so I just missed it (redoing it right now as well). Thanks Edgar On 12/26/2014 9:06 AM, Gilles Gouaillardet wrote: Edgar, First, make sure your master includes https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219 If this is not enough, try with --mca coll ^ml Hope this helps Gilles. Edgar Gabrielさんのメール: I have some problems running jobs with ompi-master on one of our clusters (after doing a major software update). Here are scenarios that work and don't work. 1. Everything still seems to work with 1.8.x series without any issues 2. With master, I can run without any issues single node, multi-process jobs 3. With master, I can run without any issues multi node jobs, as long as there is only a single MPI process per node 4. I can not run multi-node jobs with multiple processes per node, ompi hangs for that scenario. This is independent of whether I enforce using openib or tcp, and just for the sake of simplicity I attach the output for tcp (there is another openib parameter issue that still linguers, but I will report that later). Here is the output that I receive if setting btl_base_verbose ---snip-- gabriel@crill:~> salloc -N 2 -n 4 gabriel@crill:~> mpirun --mca btl tcp,self --mca btl_base_verbose 30 -np 4 ./hello_world [crill-004:18161] mca: base: components_register: registering btl components [crill-004:18161] mca: base: components_register: found loaded component self [crill-004:18161] mca: base: components_register: component self register function successful [crill-004:18161] mca: base: components_register: found loaded component tcp [crill-004:18161] mca: base: components_register: component tcp register function successful [crill-004:18161] mca: base: components_open: opening btl components [crill-004:18161] mca: base: components_open: found loaded component self [crill-004:18161] mca: base: components_open: component self open function successful [crill-004:18161] mca: base: components_open: found loaded component tcp [crill-004:18161] mca: base: components_open: component tcp open function successful [crill-004:18161] select: initializing btl component self [crill-004:18161] select: init of component self returned success [crill-004:18161] select: initializing btl component tcp [crill-004:18161] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [crill-004:18161] btl: tcp: Found match: 127.0.0.1 (lo) [crill-004:18161] select: init of component tcp returned success [crill-003:18962] mca: base: components_register: registering btl components [crill-003:18962] mca: base: components_register: found loaded component self [crill-003:18962] mca: base: components_register: component self register function successful [crill-003:18962] mca: base: components_register: found loaded component tcp [crill-003:18962] mca: base: components_register: component tcp register function successful [crill-003:18962] mca: base: components_open: opening btl components [crill-003:18962] mca: base: components_open: found loaded component self [crill-003:18962] mca: base: components_open: component self open function successful [crill-003:18962] mca: base: components_open: found loaded component tcp [crill-003:18962] mca: base: components_open: component tcp open function successful [crill-003:18963] mca: base: components_register: registering btl components [crill-003:18963] mca: base: components_register: found loaded component self [crill-003:18963] mca: base: components_register: component self register function successful [crill-003:18963] mca: base: components_register: found loaded component tcp [crill-003:18963] mca: base: components_register: component tcp register function successful [crill-003:18963] mca: base: components_open: opening btl components [crill-003:18963] mca: base: components_open: found loaded component self [crill-003:18963] mca: base: components_open: component self open function successful [crill-003:18963] mca: base: components_open: found loaded component tcp [crill-003:18963] mca: base: components_open: component tcp open function successful [crill-003:18964] mca: base: components_register: registering btl components [crill-003:18964] mca: base: components_register: found loaded component self [crill-003:18964] mca: base: components_register: component self register function successful [crill-003:18964] mca: base: components_register: found loaded component tcp [crill-003:18964] mca: base: components_register: component tcp register function successful [crill-003:18964] mca: base: components_open: opening btl components [crill-003:18964] mca: base: components_open: found loaded component self [crill-003:18964] mca: base: components_open: component self open function successful [crill-003:18964] mca: base: components_open:
Re: [OMPI devel] problem running jobs on ompi-master
Edgar, First, make sure your master includes https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219 If this is not enough, try with --mca coll ^ml Hope this helps Gilles. Edgar Gabrielさんのメール: >I have some problems running jobs with ompi-master on one of our >clusters (after doing a major software update). Here are scenarios that >work and don't work. > >1. Everything still seems to work with 1.8.x series without any issues >2. With master, I can run without any issues single node, multi-process jobs >3. With master, I can run without any issues multi node jobs, as long as >there is only a single MPI process per node > >4. I can not run multi-node jobs with multiple processes per node, ompi >hangs for that scenario. This is independent of whether I enforce using >openib or tcp, and just for the sake of simplicity I attach the output >for tcp (there is another openib parameter issue that still linguers, >but I will report that later). > >Here is the output that I receive if setting btl_base_verbose > >---snip-- >gabriel@crill:~> salloc -N 2 -n 4 >gabriel@crill:~> mpirun --mca btl tcp,self --mca btl_base_verbose 30 -np > 4 ./hello_world >[crill-004:18161] mca: base: components_register: registering btl components >[crill-004:18161] mca: base: components_register: found loaded component >self >[crill-004:18161] mca: base: components_register: component self >register function successful >[crill-004:18161] mca: base: components_register: found loaded component tcp >[crill-004:18161] mca: base: components_register: component tcp register >function successful >[crill-004:18161] mca: base: components_open: opening btl components >[crill-004:18161] mca: base: components_open: found loaded component self >[crill-004:18161] mca: base: components_open: component self open >function successful >[crill-004:18161] mca: base: components_open: found loaded component tcp >[crill-004:18161] mca: base: components_open: component tcp open >function successful >[crill-004:18161] select: initializing btl component self >[crill-004:18161] select: init of component self returned success >[crill-004:18161] select: initializing btl component tcp >[crill-004:18161] btl: tcp: Searching for exclude address+prefix: >127.0.0.1 / 8 >[crill-004:18161] btl: tcp: Found match: 127.0.0.1 (lo) >[crill-004:18161] select: init of component tcp returned success >[crill-003:18962] mca: base: components_register: registering btl components >[crill-003:18962] mca: base: components_register: found loaded component >self >[crill-003:18962] mca: base: components_register: component self >register function successful >[crill-003:18962] mca: base: components_register: found loaded component tcp >[crill-003:18962] mca: base: components_register: component tcp register >function successful >[crill-003:18962] mca: base: components_open: opening btl components >[crill-003:18962] mca: base: components_open: found loaded component self >[crill-003:18962] mca: base: components_open: component self open >function successful >[crill-003:18962] mca: base: components_open: found loaded component tcp >[crill-003:18962] mca: base: components_open: component tcp open >function successful >[crill-003:18963] mca: base: components_register: registering btl components >[crill-003:18963] mca: base: components_register: found loaded component >self >[crill-003:18963] mca: base: components_register: component self >register function successful >[crill-003:18963] mca: base: components_register: found loaded component tcp >[crill-003:18963] mca: base: components_register: component tcp register >function successful >[crill-003:18963] mca: base: components_open: opening btl components >[crill-003:18963] mca: base: components_open: found loaded component self >[crill-003:18963] mca: base: components_open: component self open >function successful >[crill-003:18963] mca: base: components_open: found loaded component tcp >[crill-003:18963] mca: base: components_open: component tcp open >function successful >[crill-003:18964] mca: base: components_register: registering btl components >[crill-003:18964] mca: base: components_register: found loaded component >self >[crill-003:18964] mca: base: components_register: component self >register function successful >[crill-003:18964] mca: base: components_register: found loaded component tcp >[crill-003:18964] mca: base: components_register: component tcp register >function successful >[crill-003:18964] mca: base: components_open: opening btl components >[crill-003:18964] mca: base: components_open: found loaded component self >[crill-003:18964] mca: base: components_open: component self open >function successful >[crill-003:18964] mca: base: components_open: found loaded component tcp >[crill-003:18964] mca: base: components_open: component tcp open >function successful >[crill-003:18962] select: initializing btl component self