Here is another possibly non-helpful suggestion.  :)  Change:

     char* name[20];
     int maxlen = 20;

To:

     char name[256];
     int maxlen = 256;

gethostname() is supposed to properly truncate the hostname it returns if the 
actual name is longer than the length provided, but since you have at least one 
that is longer than 20 characters, I'm curious.

Brent


-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Tuesday, September 27, 2011 6:29 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault on any MPI communication on head node

Hmm.  It's not immediately clear to me what's going wrong here.

I hate to ask, but could you install a debugging version of Open MPI and 
capture a proper stack trace of the segv?

Also, could you try the 1.4.4 rc and see if that magically fixes the problem? 
(I'm about to post a new 1.4.4 rc later this morning, but either the current 
one or the one from later today would be a good datapoint)


On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:

> Yep, Fedora Core 14 and OpenMPI 1.4.3
> 
> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>> Are you running the same OS version and Open MPI version between the head 
>> node and regular nodes?
>> 
>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>> 
>>> Hey all,
>>> I've been racking my brains over this for several days and was hoping 
>>> anyone could enlighten me. I'll describe only the relevant parts of the 
>>> network/computer systems. There is one head node and a multitude of regular 
>>> nodes. The regular nodes are all identical to each other. If I run an mpi 
>>> program from one of the regular nodes to any other regular nodes, 
>>> everything works. If I include the head node in the hosts file, I get 
>>> segfaults which I'll paste below along with sample code. The machines are 
>>> all networked via infiniband and Ethernet. The issue only arises when mpi 
>>> communication occurs. By this I mean, MPi_Init might succeed but the 
>>> segfault always occurs on MPI_Barrier or MPI_send/recv. I found a work 
>>> around by disabling the openib btl and enforcing that communications go 
>>> over infiniband(if I don't force infiniband, it'll go over Ethernet). This 
>>> command works when the head node is included in the hosts file:
>>> mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0  
>>> -np 2 ./b.out
>>> 
>>> Sample Code:
>>> #include "mpi.h"
>>> #include<stdio.h>
>>> int main(int argc, char *argv[])
>>> {
>>>    int rank, nprocs;
>>>     char* name[20];
>>>     int maxlen = 20;
>>>     MPI_Init(&argc,&argv);
>>>     MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>     MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>     gethostname(name,maxlen);
>>>     printf("Hello, world.  I am %d of %d and host %s \n", rank, 
>>> nprocs,name);
>>>     fflush(stdout);
>>>     MPI_Finalize();
>>>     return 0;
>>> 
>>> }
>>> 
>>> Segfault:
>>> [pastec:19917] *** Process received signal ***
>>> [pastec:19917] Signal: Segmentation fault (11)
>>> [pastec:19917] Signal code: Address not mapped (1)
>>> [pastec:19917] Failing at address: 0x8
>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) [0x7eff6430b6aa]
>>> [pastec:19917] [ 2] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) [0x7eff66a163c9]
>>> [pastec:19917] [ 3] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) [0x7eff66a21b70]
>>> [pastec:19917] [ 4] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) [0x7eff66a21c89]
>>> [pastec:19917] [ 5] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) [0x7eff66a1703d]
>>> [pastec:19917] [ 6] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) 
>>> [0x7eff676670e6]
>>> [pastec:19917] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) 
>>> [0x7eff6765b273]
>>> [pastec:19917] [ 8] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) [0x7eff65539b2f]
>>> [pastec:19917] [ 9] 
>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) [0x7eff655425cf]
>>> [pastec:19917] [10] /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) 
>>> [0x3a54c4c94e]
>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x34a841ee5d]
>>> [pastec:19917] [13] ./b.out() [0x400919]
>>> [pastec:19917] *** End of error message ***
>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] 
>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 19917 on node 
>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to