Re: [Oscar-users] How to cleanup crashed jobs on slave nodes

Michael Edwards Sun, 29 Oct 2006 16:00:33 -0800

If you do a "cexec 'ps -ef | grep mpi' " or something like that (I am
not sure what the mpi processes look like) you should get an idea if
there are stalled processes on any of the nodes.


Are you calling the next iteration before you clean up the initial
iteration?  Depending on how careful mpi2 and your program are you
could run out of resources really fast and get some peculiar results
if you didn't, I would think.

I assume torque/sge (assuming you are using such things) don't report
anything strange going on?

On 10/29/06, Michelle Chu <[EMAIL PROTECTED]> wrote:
> Hi, there,
>  I am now having a problem with an iterative Fortran 90 MPI job. I am using
> MPICH2 by calling mpiexec -np 8 executable. I initally run the job on 8
> nodes. It was working fine. After the jobs got crashed for several times
> after i changed some codes, MPI program stops  after the first iteration
> isfinished with the following error message.
>
>  I am suspicous about the stray processes still running at the salve nodes.
> I mean if the process at master node was terminated, processes at the salve
> nodes might still be running. I guess it might cause the following problem.
> But no idea about how to clean those stray processes...
>
>  Any suggestions are highly appreciated!
>
>  Thanks,
>  Michelle
>
> [cli_2]: aborting job:
>  Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
>  MPI_Alltoall(826)....................... ..:
> MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740,
> rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
>  MPIR_Alltoall(593)...................... ..:
>  MPIC_Sendrecv(161)...................... ..:
>  MPIC_Wait(324).......................... ..:
>  MPIDI_CH3_Progress_wait(217)............ ..: an error
> occurred while handling an event returned by MPIDU_Sock_Wait()
>  MPIDI_CH3I_Progress_handle_sock_event(41 5):
>  MPIDU_Socki_handle_read(670)............ ..: connection
> failure (set=0,sock=1,err no=104:Connection reset by peer)
>  [cli_4]: aborting job:
>  Fatal error in MPI_Alltoall: Other MPI e rror, error stack:
>  MPI_Alltoall(826)....................... ..:
> MPI_Alltoall(sbuf=0x910640, scount=4 0000, MPI_REAL, rbuf=0xace740,
> rcount=40 000, MPI_REAL, MPI_COMM_WORLD) failed
>  MPIR_Alltoall(593)...................... ..:
>  MPIC_Sendrecv(161)...................... ..:
>  MPIC_Wait(324).......................... ..:
>  MPIDI_CH3_Progress_wait(217)............ ..: an error
> occurred while handling an event returned by MPIDU_Sock_Wait()
>  MPIDI_CH3I_Progress_handle_sock_event(41 5):
>  MPIDU_Socki_handle_read(670)............ ..: connection
> failure (set=0,sock=3,err no=104:Connection reset by peer)
>  rank 4 in job 75  xxx.cs.xxx.edu_5671 6   caused collective abort of all
> ranks
>    exit status of rank 4: return code 1
>  rank 0 in job 75  xxx.cs.xxx.edu_5671 6   caused collective abort of all
> ranks
>    exit status of rank 0: killed by signa l 9
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>
> _______________________________________________
> Oscar-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Re: [Oscar-users] How to cleanup crashed jobs on slave nodes

Reply via email to