Re: [OMPI devel] rdmacm and udcm for 2.0.1 and RoCE

2017-01-12 Thread Jeff Squyres (jsquyres)
Did we just recently discuss the openib BTL failover capability and decide that 
it had bit-rotted?

If so, we need to amend our documentation and disable the code.


> On Jan 11, 2017, at 3:11 PM, Dave Turner  wrote:
> 
> 
>  The btl_openib_receive_queues parameters that Howard provided
> fixed our problem with getting 2.0.1 working with RoCE so thanks for
> all the help.  However, we are seeing segfaults with this when
> configured with --enable-btl-openib-failover.  I've included the 
> configuration below that the package manager uses under Gentoo.
> I also tested this after removing all of the redundant enable/disables,
> and it's definitely the --enable-btl-openib-failover that causes 2.0.1
> on RoCE to segfault.  I can enable debugging and recompile if more
> information is needed.
> 
>  Could someone also explain why these parameters need to
> be set explicitly for RoCE rather than being embedded in the code?
> 
>Dave
> 
> This is the configure line that our package manage generates:
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu
> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man
> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
> --localstatedir=/var/lib --disable-dependency-tracking
> --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1
> --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64
> --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace
> --enable-orterun-prefix-by-default --with-hwloc=/usr
> --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx
> --without-cma --with-cuda=/opt/cuda --disable-io-romio
> --disable-heterogeneous --enable-ipv6 --disable-java
> --disable-mpi-java --disable-mpi-thread-multiple --without-verbs
> --without-knem --without-psm --disable-openib-control-hdr-padding
> --disable-openib-connectx-xrc --disable-openib-rdmacm
> --disable-openib-udcm --disable-openib-dynamic-sl
> --disable-btl-openib-failover --without-tm --without-slurm --with-sge
> --enable-openib-connectx-xrc --enable-openib-rdmacm
> --enable-openib-udcm --enable-openib-dynamic-sl
> --enable-btl-openib-failover --with-verbs
> 
> On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard  wrote:
> Hi Dave,
> 
> Sorry for the delayed response.  
> 
> Anyway, you have to use rdmacm for connection management when using ROCE.
> However, with 2.0.1 and later, you have to specify per peer QP info manually
> on the mpirun command line.  
> 
> Could you try rerunning with
> 
> mpirun --mca btl_openib_receive_queues 
> P,128,64,32,32,32:S,2048,1024,128,32:S, 12288,1024,128,32:S,65536,1024,128,32 
> (all the reset of the command line args)
> 
> and see if it then works?
> 
> Howard
> 
> 
> 2017-01-04 16:37 GMT-07:00 Dave Turner :
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:   elf22
>   Local device: mlx4_2
>   Local port:   1
>   CPCs attempted:   rdmacm, udcm
> --
> 
> I posted this to the user list but got no answer so I'm reposting to
> the devel list.
> 
> We recently upgraded to OpenMPI 2.0.1.  Everything works fine
> on our QDR connections but we get the error above for our
> 40 GbE connections running RoCE.  I traced through the code and
> it looks like udcm cannot be used with RoCE.  I've also read that 
> there are currently some problems with rdmacm under 2.0.1, which
> would mean 2.0.1 does not currently work on RoCE.  We've tested
> 10.4 using rdmacm and that works fine so I don't think we have anything
> configured wrong on the RoCE side.  
>  Could someone please verify whether this information is correct that
> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
> not working.  If so, is it being worked on?
> 
>  Dave
> 
> 
> -- 
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> 
> 
> -- 
> Work: davetur...@ksu.edu (785) 532-7791
>  2219 Engineering Hall, Manhattan KS  66506
> Home:drdavetur...@gmail.com
>   cell: (785) 770-5929
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Fwd: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-12 Thread Howard Pritchard
Siegmar,

Could you confirm that if you use one of the mpirun arg lists that works
for Gilles that
your test case passes.  Something simple like

mpirun -np 1 ./spawn_master

?

Howard




2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet :

> Ralph,
>
>
> so it seems the root cause is a kind of incompatibility between the --host
> and the --slot-list options
>
>
> on a single node with two six cores sockets,
> this works :
>
> mpirun -np 1 ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master
>
>
> this does not work
>
> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available,
> aborts with a user friendly error message
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master #
> various errors sm_segment_attach() fails, a task crashes
> and this ends up with the following error message
>
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[15519,2],0]) is on host: motomachi
>   Process 2 ([[15519,2],1]) is on host: unknown!
>   BTLs attempted: self tcp
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master #
> same error as above
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master #
> same error as above
>
>
> for the record, the following command surprisingly works
>
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self
> ./spawn_master
>
>
>
> bottom line, my guess is that when the user specifies the --slot-list and
> the --host options
> *and* there are no default slot numbers to hosts, we should default to
> using the number
> of slots from the slot list.
> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess)
> --host motomachi:1)
>
>
> /* fwiw, i made
>
> https://github.com/open-mpi/ompi/pull/2715
>
> https://github.com/open-mpi/ompi/pull/2715
>
> but these are not the root cause */
>
>
> Cheers,
>
>
> Gilles
>
>
>
>  Forwarded Message 
> Subject: Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3
> on Linux
> Date: Wed, 11 Jan 2017 20:39:02 +0900
> From: Gilles Gouaillardet 
> 
> Reply-To: Open MPI Users 
> 
> To: Open MPI Users  
>
>
> Siegmar,
>
> Your slot list is correct.
> An invalid slot list for your node would be 0:1-7,1:0-7
>
> /* and since the test requires only 5 tasks, that could even work with
> such an invalid list.
> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
> unfriendly pmix error */
>
> Bottom line, your test is correct, and there is a bug in v2.0.x that I
> will investigate from tomorrow
>
> Cheers,
>
> Gilles
>
> On Wednesday, January 11, 2017, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
>
>> Hi Gilles,
>>
>> thank you very much for your help. What does incorrect slot list
>> mean? My machine has two 6-core processors so that I specified
>> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
>> allowed to specify more slots than available, to specify fewer
>> slots than available, or to specify more slots than needed for
>> the processes?
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
>>
>>> Siegmar,
>>>
>>> I was able to reproduce the issue on my vm
>>> (No need for a real heterogeneous cluster here)
>>>
>>> I will keep digging tomorrow.
>>> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails
>>> with a very unfriendly error message.
>>> Right now, the 4th spawn'ed task crashes, so this is a different issue
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> r...@open-mpi.org wrote:
>>> I think there is some relevant discussion here:
>>> https://github.com/open-mpi/ompi/issues/1569
>>>
>>> It looks like Gilles had (at least at one point) a fix for master when
>>> enable-heterogeneous, but I don’t know if that was committed.
>>>
>>> On Jan 9, 2017, at 8:23 AM, Howard Pritchard >>> > wrote:

 HI Siegmar,

 You have some config parameters I wasn't trying that may have some
 impact.
 I'll give a try with these parameters.

 This should be enough info for now,

 Thanks,

 Howard


 2017-01-09 0:59 GMT-07:00 Siegmar Gross >>> ulda.de >:

 Hi Howard,

 I use the following commands to build and install the package.
 ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
 Linux machine.

 mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
 cd ope

Re: [OMPI devel] [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-12 Thread r...@open-mpi.org
Fix is pending here: https://github.com/open-mpi/ompi/pull/2730 


> On Jan 12, 2017, at 8:57 AM, Howard Pritchard  > wrote:
> 
> Siegmar,
> 
> Could you confirm that if you use one of the mpirun arg lists that works for 
> Gilles that
> your test case passes.  Something simple like
> 
> mpirun -np 1 ./spawn_master
> 
> ?
> 
> Howard
> 
> 
> 
> 
> 2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet  >:
> Ralph,
> 
> 
> so it seems the root cause is a kind of incompatibility between the --host 
> and the --slot-list options
> 
> 
> on a single node with two six cores sockets, 
> this works :
> 
> mpirun -np 1 ./spawn_master 
> mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
> mpirun -np 1 --host motomachi --oversubscribe ./spawn_master 
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master 
> 
> 
> this does not work
>  
> mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, 
> aborts with a user friendly error message
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # 
> various errors sm_segment_attach() fails, a task crashes
> and this ends up with the following error message
> 
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[15519,2],0]) is on host: motomachi
>   Process 2 ([[15519,2],1]) is on host: unknown!
>   BTLs attempted: self tcp
> 
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # same 
> error as above
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # same 
> error as above
> 
> 
> for the record, the following command surprisingly works
> 
> mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self 
> ./spawn_master
> 
> 
> 
> bottom line, my guess is that when the user specifies the --slot-list and the 
> --host options
> *and* there are no default slot numbers to hosts, we should default to using 
> the number
> of slots from the slot list.
> (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) 
> --host motomachi:1)
> 
> 
> /* fwiw, i made
> 
> https://github.com/open-mpi/ompi/pull/2715 
> 
> https://github.com/open-mpi/ompi/pull/2715 
> 
> but these are not the root cause */
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> 
>  Forwarded Message 
> Subject:  Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 
> on Linux
> Date: Wed, 11 Jan 2017 20:39:02 +0900
> From: Gilles Gouaillardet  
> 
> Reply-To: Open MPI Users  
> 
> To:   Open MPI Users  
> 
> 
> 
> Siegmar,
> 
> Your slot list is correct.
> An invalid slot list for your node would be 0:1-7,1:0-7
> 
> /* and since the test requires only 5 tasks, that could even work with such 
> an invalid list.
> My vm is single socket with 4 cores, so a 0:0-4 slot list results in an 
> unfriendly pmix error */
> 
> Bottom line, your test is correct, and there is a bug in v2.0.x that I will 
> investigate from tomorrow 
> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, January 11, 2017, Siegmar Gross 
>  > wrote:
> Hi Gilles,
> 
> thank you very much for your help. What does incorrect slot list
> mean? My machine has two 6-core processors so that I specified
> "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
> allowed to specify more slots than available, to specify fewer
> slots than available, or to specify more slots than needed for
> the processes?
> 
> 
> Kind regards
> 
> Siegmar
> 
> Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet:
> Siegmar,
> 
> I was able to reproduce the issue on my vm
> (No need for a real heterogeneous cluster here)
> 
> I will keep digging tomorrow.
> Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a 
> very unfriendly error message.
> Right now, the 4th spawn'ed task crashes, so this is a different issue
> 
> Cheers,
> 
> Gilles
> 
> r...@open-mpi.org <> wrote:
> I think there is some relevant discussion here: 
> https://github.com/open-mpi/ompi/issues/1569 
> 
> 
> It looks like Gilles had (at least at one point) a fix for master when 
> enable-heterogeneous, but I don’t know if that was committed.
> 
> On Jan 9, 2017, at 8:23 AM, Howard Pritchard  
> >> wrote:
> 
> HI Siegmar,
> 
> You have some config parameters I wasn't trying that may have some impact.
>

[OMPI devel] OMPI v1.10.6

2017-01-12 Thread r...@open-mpi.org
Hi folks

It looks like we may have motivation to release 1.10.6 in the near future. 
Please check to see if you have anything that should be included, or is pending 
review.

Thanks
Ralph

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] rdmacm and udcm for 2.0.1 and RoCE

2017-01-12 Thread Jeff Squyres (jsquyres)
I checked:

1. This option existed in v2.0.1, but it no longer exists in the 
soon-to-be-released v2.0.2.
2. Here's where we removed it: https://github.com/open-mpi/ompi/pull/2350

There's no rationale listed on that PR, but the reason is because it's stale 
and no longer works.

Sorry Dave.  :-\


> On Jan 12, 2017, at 6:54 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Did we just recently discuss the openib BTL failover capability and decide 
> that it had bit-rotted?
> 
> If so, we need to amend our documentation and disable the code.
> 
> 
>> On Jan 11, 2017, at 3:11 PM, Dave Turner  wrote:
>> 
>> 
>> The btl_openib_receive_queues parameters that Howard provided
>> fixed our problem with getting 2.0.1 working with RoCE so thanks for
>> all the help.  However, we are seeing segfaults with this when
>> configured with --enable-btl-openib-failover.  I've included the 
>> configuration below that the package manager uses under Gentoo.
>> I also tested this after removing all of the redundant enable/disables,
>> and it's definitely the --enable-btl-openib-failover that causes 2.0.1
>> on RoCE to segfault.  I can enable debugging and recompile if more
>> information is needed.
>> 
>> Could someone also explain why these parameters need to
>> be set explicitly for RoCE rather than being embedded in the code?
>> 
>>   Dave
>> 
>> This is the configure line that our package manage generates:
>> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu
>> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man
>> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
>> --localstatedir=/var/lib --disable-dependency-tracking
>> --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1
>> --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64
>> --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace
>> --enable-orterun-prefix-by-default --with-hwloc=/usr
>> --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx
>> --without-cma --with-cuda=/opt/cuda --disable-io-romio
>> --disable-heterogeneous --enable-ipv6 --disable-java
>> --disable-mpi-java --disable-mpi-thread-multiple --without-verbs
>> --without-knem --without-psm --disable-openib-control-hdr-padding
>> --disable-openib-connectx-xrc --disable-openib-rdmacm
>> --disable-openib-udcm --disable-openib-dynamic-sl
>> --disable-btl-openib-failover --without-tm --without-slurm --with-sge
>> --enable-openib-connectx-xrc --enable-openib-rdmacm
>> --enable-openib-udcm --enable-openib-dynamic-sl
>> --enable-btl-openib-failover --with-verbs
>> 
>> On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard  
>> wrote:
>> Hi Dave,
>> 
>> Sorry for the delayed response.  
>> 
>> Anyway, you have to use rdmacm for connection management when using ROCE.
>> However, with 2.0.1 and later, you have to specify per peer QP info manually
>> on the mpirun command line.  
>> 
>> Could you try rerunning with
>> 
>> mpirun --mca btl_openib_receive_queues 
>> P,128,64,32,32,32:S,2048,1024,128,32:S, 
>> 12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line 
>> args)
>> 
>> and see if it then works?
>> 
>> Howard
>> 
>> 
>> 2017-01-04 16:37 GMT-07:00 Dave Turner :
>> --
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>> 
>>  Local host:   elf22
>>  Local device: mlx4_2
>>  Local port:   1
>>  CPCs attempted:   rdmacm, udcm
>> --
>> 
>>I posted this to the user list but got no answer so I'm reposting to
>> the devel list.
>> 
>>We recently upgraded to OpenMPI 2.0.1.  Everything works fine
>> on our QDR connections but we get the error above for our
>> 40 GbE connections running RoCE.  I traced through the code and
>> it looks like udcm cannot be used with RoCE.  I've also read that 
>> there are currently some problems with rdmacm under 2.0.1, which
>> would mean 2.0.1 does not currently work on RoCE.  We've tested
>> 10.4 using rdmacm and that works fine so I don't think we have anything
>> configured wrong on the RoCE side.  
>> Could someone please verify whether this information is correct that
>> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
>> not working.  If so, is it being worked on?
>> 
>> Dave
>> 
>> 
>> -- 
>> Work: davetur...@ksu.edu (785) 532-7791
>> 2219 Engineering Hall, Manhattan KS  66506
>> Home:drdavetur...@gmail.com
>>  cell: (785) 770-5929
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> 
>> 
>> 
>> -- 
>> Work: davetur...@ksu.edu (785) 532-7791
>> 2219 Engineering Hall, Manhattan

Re: [OMPI devel] [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-12 Thread Siegmar Gross

Hi Howard and Gilles,

thank you very much for your help. All commands that work for
Gilles work also on my machine as expected and the commands that
don't work on his machine don't work on my neither. The first one
that works with both --slot-list and --host is the following
command so that it seems that the value depends on the number of
processes in the remote group.

loki spawn 122 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:3 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 3

Slave process 0 of 3 running on loki
spawn_slave 0: argv[0]: spawn_slave
Slave process 1 of 3 running on loki
spawn_slave 1: argv[0]: spawn_slave
Slave process 2 of 3 running on loki
spawn_slave 2: argv[0]: spawn_slave
loki spawn 123


Here is the output from the other commands.

loki spawn 112 mpirun -np 1 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
loki spawn 113 mpirun -np 1 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 114 mpirun -np 1 --host loki --oversubscribe spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 115 mpirun -np 1 --slot-list 0:0-5,1:0-5 --host loki:12 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

spawn_slave 2: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
loki spawn 116 mpirun -np 1 --host loki:12 --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
  I create 4 slave processes

Slave process 0 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
Slave process 3 of 4 running on loki
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
Parent process 0: tasks in MPI_COMM_WORLD:1
  tasks in COMM_CHILD_PROCESSES local group:  1
  tasks in COMM_CHILD_PROCESSES remote group: 4

loki spawn 117


Kind regards

Siegmar

Am 12.01.2017 um 22:25 schrieb r...@open-mpi.org:

Fix is pending here: https://github.com/open-mpi/ompi/pull/2730


On Jan 12, 2017, at 8:57 AM, Howard Pritchard mailto:hpprit...@gmail.com>> wrote:

Siegmar,

Could you confirm that if you use one of the mpirun arg lists that works for 
Gilles that
your test case passes.  Something simple like

mpirun -np 1 ./spawn_master

?

Howard




2017-01-11 18:27 GMT-07:00 Gilles Gouaillardet mailto:gil...@rist.or.jp>>:

Ralph,


so it seems the root cause is a kind of incompatibility between the --host 
and the --slot-list options


on a single node with two six cores sockets,

this works :

mpirun -np 1 ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 ./spawn_master
mpirun -np 1 --host motomachi --oversubscribe ./spawn_master
mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master