Hi, there,
I am now having a problem with an iterative Fortran 90 MPI job. I am using MPICH2 by calling mpiexec -np 8 executable. I initally run the job on 8 nodes. It was working fine. After the jobs got crashed for several times after i changed some codes, MPI program stops  after the first iteration isfinished with the following error message.

I am suspicous about the stray processes still running at the salve nodes. I mean if the process at master node was terminated, processes at the salve nodes might still be running. I guess it might cause the following problem. But no idea about how to clean those stray processes...

Any suggestions are highly appreciated!

Thanks,
Michelle

[cli_2]: aborting job:
Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
MPI_Alltoall(826)....................... ..: MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
MPIR_Alltoall(593)...................... ..:
MPIC_Sendrecv(161)...................... ..:
MPIC_Wait(324).......................... ..:
MPIDI_CH3_Progress_wait(217)............ ..: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(41 5):
MPIDU_Socki_handle_read(670)............ ..: connection failure (set=0,sock=1,err no=104:Connection reset by peer)
[cli_4]: aborting job:
Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
MPI_Alltoall(826)....................... ..: MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
MPIR_Alltoall(593)...................... ..:
MPIC_Sendrecv(161)...................... ..:
MPIC_Wait(324).......................... ..:
MPIDI_CH3_Progress_wait(217)............ ..: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(41 5):
MPIDU_Socki_handle_read(670)............ ..: connection failure (set=0,sock=3,err no=104:Connection reset by peer)
rank 4 in job 75  xxx.cs.xxx.edu_5671 6   caused collective abort of all ranks
  exit status of rank 4: return code 1
rank 0 in job 75  xxx.cs.xxx.edu_5671 6   caused collective abort of all ranks
  exit status of rank 0: killed by signa l 9

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to