If you do a "cexec 'ps -ef | grep mpi' " or something like that (I am not sure what the mpi processes look like) you should get an idea if there are stalled processes on any of the nodes.
Are you calling the next iteration before you clean up the initial iteration? Depending on how careful mpi2 and your program are you could run out of resources really fast and get some peculiar results if you didn't, I would think. I assume torque/sge (assuming you are using such things) don't report anything strange going on? On 10/29/06, Michelle Chu <[EMAIL PROTECTED]> wrote: > Hi, there, > I am now having a problem with an iterative Fortran 90 MPI job. I am using > MPICH2 by calling mpiexec -np 8 executable. I initally run the job on 8 > nodes. It was working fine. After the jobs got crashed for several times > after i changed some codes, MPI program stops after the first iteration > isfinished with the following error message. > > I am suspicous about the stray processes still running at the salve nodes. > I mean if the process at master node was terminated, processes at the salve > nodes might still be running. I guess it might cause the following problem. > But no idea about how to clean those stray processes... > > Any suggestions are highly appreciated! > > Thanks, > Michelle > > [cli_2]: aborting job: > Fatal error in MPI_Alltoall: Other MPI e rror, error stack: > MPI_Alltoall(826)....................... ..: > MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, > rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed > MPIR_Alltoall(593)...................... ..: > MPIC_Sendrecv(161)...................... ..: > MPIC_Wait(324).......................... ..: > MPIDI_CH3_Progress_wait(217)............ ..: an error > occurred while handling an event returned by MPIDU_Sock_Wait() > MPIDI_CH3I_Progress_handle_sock_event(41 5): > MPIDU_Socki_handle_read(670)............ ..: connection > failure (set=0,sock=1,err no=104:Connection reset by peer) > [cli_4]: aborting job: > Fatal error in MPI_Alltoall: Other MPI e rror, error stack: > MPI_Alltoall(826)....................... ..: > MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740, > rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed > MPIR_Alltoall(593)...................... ..: > MPIC_Sendrecv(161)...................... ..: > MPIC_Wait(324).......................... ..: > MPIDI_CH3_Progress_wait(217)............ ..: an error > occurred while handling an event returned by MPIDU_Sock_Wait() > MPIDI_CH3I_Progress_handle_sock_event(41 5): > MPIDU_Socki_handle_read(670)............ ..: connection > failure (set=0,sock=3,err no=104:Connection reset by peer) > rank 4 in job 75 xxx.cs.xxx.edu_5671 6 caused collective abort of all > ranks > exit status of rank 4: return code 1 > rank 0 in job 75 xxx.cs.xxx.edu_5671 6 caused collective abort of all > ranks > exit status of rank 0: killed by signa l 9 > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Oscar-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/oscar-users > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
