Open-MPI Users,


I've been using OpenMPI for a while now and am very pleased with it. I use
the OpenMPI system across eight Red Hat Linux nodes (8 cores each) on 1 Gbps
Ethernet behind a dedicated switch. After working out kinks in the
beginning, we've been using it periodically anywhere from 8 cores to 64
cores. We use a finite element software named LS-DYNA. We do not have source
code for this program, it is compiled to work with OpenMPI 1.4.1 (I use
1.4.2) and we cannot make changes or request code to see how it performs
certain functions.



>From time to time, I will be simulating a particular "job" in LS-DYNA and
for some reason, it will quit OpenMPI issuing a MPI_ABORT command stating
that "connect to address xx.xxx.xxx.xxx port xxx: Connection refused; trying
normal rsh (/usr/bin/rsh)." This error comes after running for hours, which
means that connections to the node it's citing have already been made
previously. The particular node it names is random and changes from
simulation to simulation. We use SSH to communicate and we have the ports
open for node-to-node communications on any port. 



Does any user have experience with this error where a connection is
established, and used for several hours, but after a seemingly random period
of time the program dies stating it can't make a connection?



Thanks,

Robert Walters 

Reply via email to