Signal 9 more than likely means that some external entity killed your MPI job (e.g., a resource manager determined that your process took too much time / CPU / whatever and killed it). That also makes sense since you say that short jobs complete with no problem, but (assumedly) longer jobs get killed like you described below -- with signal 9.
You might want to check with your system administrator and see if there are any resource limits on user-run applications. On Jul 22, 2010, at 8:18 PM, Jack Bryan wrote: > Dear All: > > I run a parallel job on 6 nodes of an OpenMPI cluster. > > But I got error: > > rank 0 in job 82 system.cluster_37948 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > It seems that there is segmentation fault on node 0. > > But, if the program is run for a short time, no problem. > > Any help is appreciated. > > thanks, > > Jack > > July 22 2010 > > The New Busy is not the old busy. Search, chat and e-mail from your inbox. > Get started. _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/