Howard and Edgar,

i fixed a few bugs (r32639 and r32642)

the bug is trivial to reproduce with any mpi hello world program

mpirun -np 2 --mca btl openib,self hello_world

after setting the mca param in the $HOME/.openmpi/mca-params.conf

$ cat ~/.openmpi/mca-params.conf
btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3

good news is the program does not crash with a glory SIGSEGV any more
bad news is the program will (nicely) abort for an incorrect reason :

--------------------------------------------------------------------------
The Open MPI receive queue configuration for the OpenFabrics devices
on two nodes are incompatible, meaning that MPI processes on two
specific nodes were unable to communicate with each other.  This
generally happens when you are using OpenFabrics devices from
different vendors on the same network.  You should be able to use the
mca_btl_openib_receive_queues MCA parameter to set a uniform receive
queue configuration for all the devices in the MPI job, and therefore
be able to run successfully.

  Local host:       node0
  Local adapter:    mlx4_0 (vendor 0x2c9, part ID 4099)
  Local queues:     S,12288,128,64,32:S,65536,128,64,3

  Remote host:      node0
  Remote adapter:   (vendor 0x2c9, part ID 4099)
  Remote queues:   
P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

the root cause is the remote host did not send its receive_queues to the
local host
(and hence the local host believes the remote hosts uses the default value)

the logic was revamped vs v1.8, that is why v1.8 does not have such issue.

i am still thinking what should be the right fix :
- one option is to send the receive queues
- an other option would be to differenciate value overrided in
mca-params.conf (should be always ok) of value overrided in the .ini
  (might want to double check local and remote values match)

Cheers,

Gilles

On 2014/08/29 7:02, Pritchard Jr., Howard wrote:
> Hi Edgar,
>
> Could you send me your conf file?  I'll try to reproduce it.
>
> Maybe run with --mca btl_base_verbose 20 or something to
> see what the code that is parsing this field in the conf file
> is finding.
>
>
> Howard
>
>
> -----Original Message-----
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Edgar Gabriel
> Sent: Thursday, August 28, 2014 3:40 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] segfault in openib component on trunk
>
> to add another piece of information that I just found, the segfault only 
> occurs if I have a particular mca parameter set in my mca-params.conf file, 
> namely
>
> btl_openib_receive_queues = S,12288,128,64,32:S,65536,128,64,3
>
> Has the syntax for this parameter changed, or should/can I get rid of it?
>
> Thanks
> Edgar
>
> On 08/28/2014 04:19 PM, Edgar Gabriel wrote:
>> we are having recently problems running trunk with openib component 
>> enabled on one of our clusters. The problem occurs right in the 
>> initialization part, here is the stack right before the segfault:
>>
>> ---snip---
>> (gdb) where
>> #0  mca_btl_openib_tune_endpoint (openib_btl=0x762a40,
>> endpoint=0x7d9660) at btl_openib.c:470
>> #1  0x00007f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, 
>> nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at
>> btl_openib.c:1093
>> #2  0x00007f106316102c in mca_bml_r2_add_procs (nprocs=2, 
>> procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201
>> #3  0x00007f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00,
>> nprocs=2) at pml_ob1.c:334
>> #4  0x00007f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, 
>> requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790
>> #5  0x00007f1068273a2c in MPI_Init (argc=0x7fff22dd188c,
>> argv=0x7fff22dd1880) at init.c:84
>> #6  0x00000000004008e7 in main (argc=1, argv=0x7fff22dd1da8) at
>> hello_world.c:13
>> ---snip---
>>
>>
>> in line 538 of the file containing the mca_btl_openib_tune_endpoint 
>> routine, the strcmp operation fails, because  recv_qps is a NULL pointer.
>>
>>
>> ---snip---
>>
>> if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {
>>
>> ---snip---
>>
>> Does anybody have an idea on what might be going wrong and how to 
>> resolve it? Just to confirm, everything works perfectly with the 1.8 
>> series on that very same  cluster
>>
>> Thanks
>> Edgar
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15746.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15747.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15748.php

Reply via email to