Hi, I am happy to state that I believe I have finally found the fix for the No route to host error!!!!
The solution was to increase the ARP cache in the head node and also add a few static ARP entries. The cache was running out sometime during the program execution leading to connection disruption and the error messages. I am not too sure though as to how the program did successfully run on certain occasions previously. I want to thank everyone who helped me with this - particularly Eric and Jeff - for sharing their thoughts and also for their time and effort. Thanks a lot guys. On a side note, the other issue I noticed with the trivial execution of my helloWorld program with 1 process failing when run in debug mode, that is something I have not resolved and will take a bit longer since, as Eric mentioned, I need to upgrade the GCC version and also fix the optimization flags and update all the nodes. This is something I intend to follow up on and fix but I ll be doing it a bit later. I ll update the mailing list once I make any progress on the same. Again, thanks a lot guys for your invaluable help. Regards, Prasanna. On 9/15/08 11:08 AM, "users-requ...@open-mpi.org" <users-requ...@open-mpi.org> wrote: > Message: 1 > Date: Mon, 15 Sep 2008 12:42:50 -0400 > From: Eric Thibodeau <ky...@neuralbs.com> > Subject: Re: [OMPI users] Need help resolving No route to host error > with OpenMPI 1.1.2 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <48ce908a.9080...@neuralbs.com> > Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" > > Simply to keep track of what's going on: > > I checked the build environment for openmpi and the system's setting, > they were built using gcc 3.4.4 with -Os, which was reputed unstable and > problematic with this compiler version. I've asked Prasanna to rebuild > using -O2 but this could be a bit lengthy since the entire system (or at > least all libs openmpi links to) needs to be rebuilt. > > Eric > > Eric Thibodeau wrote: >> Prasanna, >> >> Please send me your /etc/make.conf and the contents of >> /var/db/pkg/sys-cluster/openmpi-1.2.7/ >> >> You can package this with the following command line: >> >> tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/ >> >> And simply send me the data.tbz file. >> >> Thanks, >> >> Eric >> >> Prasanna Ranganathan wrote: >>> Hi, >>> >>> I did make sure at the beginning that only eth0 was activated on all the >>> nodes. Nevertheless, I am currently verifying the NIC configuration on all >>> the nodes and making sure things are as expected. >>> >>> While trying different things, I did come across this peculiar error which I >>> had detailed in one of my previous mails in this thread. >>> >>> I am testing the helloWorld program in the following trivial case: >>> >>> mpirun -np 1 -host localhost /main/mpiHelloWorld >>> >>> Which works fine. >>> >>> But, >>> >>> mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld >>> >>> always fails as follows: >>> >>> Daemon [0,0,1] checking in as pid 2059 on host localhost >>> [idx1:02059] [0,0,1] orted: received launch callback >>> idx1 is node 0 of 1 >>> ranks sum to 0 >>> [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0] >>> [idx1:02059] [0,0,1] orted_recv_pls: received exit >>> [idx1:02059] *** Process received signal *** >>> [idx1:02059] Signal: Segmentation fault (11) >>> [idx1:02059] Signal code: (128) >>> [idx1:02059] Failing at address: (nil) >>> [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30] >>> [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) >>> [0x2afa8be8e2a2] >>> [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) >>> [0x2afa8be795ac] >>> [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) >>> [0x2afa8be7675c] >>> [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae] >>> [idx1:02059] *** End of error message *** >>> >>> The failure happens with more verbose output when using the -d flag. >>> >>> Does this point to some bug in OpenMPI or am I missing something here? >>> >>> I have attached ompi_info output on this node. >>> >>> Regards, >>> >>> Prasanna. >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users