I am now having a problem with an iterative Fortran 90 MPI job. I am using MPICH2 by calling mpiexec -np 8 executable. I initally run the job on 8 nodes. It was working fine. After the jobs got crashed for several times after i changed some codes, MPI program stops after the first iteration isfinished with the following error message.
I am suspicous about the stray processes still running at the salve nodes. I mean if the process at master node was terminated, processes at the salve nodes might still be running. I guess it might cause the following problem. But no idea about how to clean those stray processes...
Any suggestions are highly appreciated!
Thanks,
Michelle
[cli_2]: aborting job:
Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
MPI_Alltoall(826)....................... ..: MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
MPIR_Alltoall(593)...................... ..:
MPIC_Sendrecv(161)...................... ..:
MPIC_Wait(324).......................... ..:
MPIDI_CH3_Progress_wait(217)............ ..: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(41 5):
MPIDU_Socki_handle_read(670)............ ..: connection failure (set=0,sock=1,err no=104:Connection reset by peer)
[cli_4]: aborting job:
Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
MPI_Alltoall(826)....................... ..: MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
MPIR_Alltoall(593)...................... ..:
MPIC_Sendrecv(161)...................... ..:
MPIC_Wait(324).......................... ..:
MPIDI_CH3_Progress_wait(217)............ ..: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(41 5):
MPIDU_Socki_handle_read(670)............ ..: connection failure (set=0,sock=3,err no=104:Connection reset by peer)
rank 4 in job 75 xxx.cs.xxx.edu_5671 6 caused collective abort of all ranks
exit status of rank 4: return code 1
rank 0 in job 75 xxx.cs.xxx.edu_5671 6 caused collective abort of all ranks
exit status of rank 0: killed by signa l 9
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
