Re: [OMPI devel] problem running jobs on ompi-master

2014-12-26 Thread Edgar Gabriel
I applied the patch manually and it seemed in fact to resolve the issue, 
thanks! I must have done the git clone just right before this patch was 
committed two days back, so I just missed it (redoing it right now as well).


Thanks
Edgar

On 12/26/2014 9:06 AM, Gilles Gouaillardet wrote:

Edgar,

First, make sure your master includes 
https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219


If this is not enough, try with --mca coll ^ml

Hope this helps

Gilles.

Edgar Gabriel さんのメール:

I have some problems running jobs with ompi-master on one of our
clusters (after doing a major software update). Here are scenarios that
work and don't work.

1. Everything still seems to work with 1.8.x series without any issues
2. With master, I can run without any issues single node, multi-process jobs
3. With master, I can run without any issues multi node jobs, as long as
there is only a single MPI process per node

4. I can not run multi-node jobs with multiple processes per node, ompi
hangs for that scenario. This is independent of whether I enforce using
openib or tcp, and just for the sake of simplicity I attach the output
for tcp (there is another openib parameter issue that still linguers,
but I will report that later).

Here is the output that I receive if setting btl_base_verbose

---snip--
gabriel@crill:~> salloc -N 2 -n 4
gabriel@crill:~> mpirun --mca btl tcp,self --mca btl_base_verbose 30 -np
  4 ./hello_world
[crill-004:18161] mca: base: components_register: registering btl components
[crill-004:18161] mca: base: components_register: found loaded component
self
[crill-004:18161] mca: base: components_register: component self
register function successful
[crill-004:18161] mca: base: components_register: found loaded component tcp
[crill-004:18161] mca: base: components_register: component tcp register
function successful
[crill-004:18161] mca: base: components_open: opening btl components
[crill-004:18161] mca: base: components_open: found loaded component self
[crill-004:18161] mca: base: components_open: component self open
function successful
[crill-004:18161] mca: base: components_open: found loaded component tcp
[crill-004:18161] mca: base: components_open: component tcp open
function successful
[crill-004:18161] select: initializing btl component self
[crill-004:18161] select: init of component self returned success
[crill-004:18161] select: initializing btl component tcp
[crill-004:18161] btl: tcp: Searching for exclude address+prefix:
127.0.0.1 / 8
[crill-004:18161] btl: tcp: Found match: 127.0.0.1 (lo)
[crill-004:18161] select: init of component tcp returned success
[crill-003:18962] mca: base: components_register: registering btl components
[crill-003:18962] mca: base: components_register: found loaded component
self
[crill-003:18962] mca: base: components_register: component self
register function successful
[crill-003:18962] mca: base: components_register: found loaded component tcp
[crill-003:18962] mca: base: components_register: component tcp register
function successful
[crill-003:18962] mca: base: components_open: opening btl components
[crill-003:18962] mca: base: components_open: found loaded component self
[crill-003:18962] mca: base: components_open: component self open
function successful
[crill-003:18962] mca: base: components_open: found loaded component tcp
[crill-003:18962] mca: base: components_open: component tcp open
function successful
[crill-003:18963] mca: base: components_register: registering btl components
[crill-003:18963] mca: base: components_register: found loaded component
self
[crill-003:18963] mca: base: components_register: component self
register function successful
[crill-003:18963] mca: base: components_register: found loaded component tcp
[crill-003:18963] mca: base: components_register: component tcp register
function successful
[crill-003:18963] mca: base: components_open: opening btl components
[crill-003:18963] mca: base: components_open: found loaded component self
[crill-003:18963] mca: base: components_open: component self open
function successful
[crill-003:18963] mca: base: components_open: found loaded component tcp
[crill-003:18963] mca: base: components_open: component tcp open
function successful
[crill-003:18964] mca: base: components_register: registering btl components
[crill-003:18964] mca: base: components_register: found loaded component
self
[crill-003:18964] mca: base: components_register: component self
register function successful
[crill-003:18964] mca: base: components_register: found loaded component tcp
[crill-003:18964] mca: base: components_register: component tcp register
function successful
[crill-003:18964] mca: base: components_open: opening btl components
[crill-003:18964] mca: base: components_open: found loaded component self
[crill-003:18964] mca: base: components_open: component self open
function successful
[crill-003:18964] mca: base: components_open: 

Re: [OMPI devel] problem running jobs on ompi-master

2014-12-26 Thread Gilles Gouaillardet
Edgar,

First, make sure your master includes 
https://github.com/open-mpi/ompi/commit/05af80b3025dbb95bdd4280087450791291d7219


If this is not enough, try with --mca coll ^ml

Hope this helps

Gilles. 

Edgar Gabriel さんのメール:
>I have some problems running jobs with ompi-master on one of our 
>clusters (after doing a major software update). Here are scenarios that 
>work and don't work.
>
>1. Everything still seems to work with 1.8.x series without any issues
>2. With master, I can run without any issues single node, multi-process jobs
>3. With master, I can run without any issues multi node jobs, as long as 
>there is only a single MPI process per node
>
>4. I can not run multi-node jobs with multiple processes per node, ompi 
>hangs for that scenario. This is independent of whether I enforce using 
>openib or tcp, and just for the sake of simplicity I attach the output 
>for tcp (there is another openib parameter issue that still linguers, 
>but I will report that later).
>
>Here is the output that I receive if setting btl_base_verbose
>
>---snip--
>gabriel@crill:~> salloc -N 2 -n 4
>gabriel@crill:~> mpirun --mca btl tcp,self --mca btl_base_verbose 30 -np 
>  4 ./hello_world
>[crill-004:18161] mca: base: components_register: registering btl components
>[crill-004:18161] mca: base: components_register: found loaded component 
>self
>[crill-004:18161] mca: base: components_register: component self 
>register function successful
>[crill-004:18161] mca: base: components_register: found loaded component tcp
>[crill-004:18161] mca: base: components_register: component tcp register 
>function successful
>[crill-004:18161] mca: base: components_open: opening btl components
>[crill-004:18161] mca: base: components_open: found loaded component self
>[crill-004:18161] mca: base: components_open: component self open 
>function successful
>[crill-004:18161] mca: base: components_open: found loaded component tcp
>[crill-004:18161] mca: base: components_open: component tcp open 
>function successful
>[crill-004:18161] select: initializing btl component self
>[crill-004:18161] select: init of component self returned success
>[crill-004:18161] select: initializing btl component tcp
>[crill-004:18161] btl: tcp: Searching for exclude address+prefix: 
>127.0.0.1 / 8
>[crill-004:18161] btl: tcp: Found match: 127.0.0.1 (lo)
>[crill-004:18161] select: init of component tcp returned success
>[crill-003:18962] mca: base: components_register: registering btl components
>[crill-003:18962] mca: base: components_register: found loaded component 
>self
>[crill-003:18962] mca: base: components_register: component self 
>register function successful
>[crill-003:18962] mca: base: components_register: found loaded component tcp
>[crill-003:18962] mca: base: components_register: component tcp register 
>function successful
>[crill-003:18962] mca: base: components_open: opening btl components
>[crill-003:18962] mca: base: components_open: found loaded component self
>[crill-003:18962] mca: base: components_open: component self open 
>function successful
>[crill-003:18962] mca: base: components_open: found loaded component tcp
>[crill-003:18962] mca: base: components_open: component tcp open 
>function successful
>[crill-003:18963] mca: base: components_register: registering btl components
>[crill-003:18963] mca: base: components_register: found loaded component 
>self
>[crill-003:18963] mca: base: components_register: component self 
>register function successful
>[crill-003:18963] mca: base: components_register: found loaded component tcp
>[crill-003:18963] mca: base: components_register: component tcp register 
>function successful
>[crill-003:18963] mca: base: components_open: opening btl components
>[crill-003:18963] mca: base: components_open: found loaded component self
>[crill-003:18963] mca: base: components_open: component self open 
>function successful
>[crill-003:18963] mca: base: components_open: found loaded component tcp
>[crill-003:18963] mca: base: components_open: component tcp open 
>function successful
>[crill-003:18964] mca: base: components_register: registering btl components
>[crill-003:18964] mca: base: components_register: found loaded component 
>self
>[crill-003:18964] mca: base: components_register: component self 
>register function successful
>[crill-003:18964] mca: base: components_register: found loaded component tcp
>[crill-003:18964] mca: base: components_register: component tcp register 
>function successful
>[crill-003:18964] mca: base: components_open: opening btl components
>[crill-003:18964] mca: base: components_open: found loaded component self
>[crill-003:18964] mca: base: components_open: component self open 
>function successful
>[crill-003:18964] mca: base: components_open: found loaded component tcp
>[crill-003:18964] mca: base: components_open: component tcp open 
>function successful
>[crill-003:18962] select: initializing btl component self