Signal 9 more than likely means that some external entity killed your MPI job 
(e.g., a resource manager determined that your process took too much time / CPU 
/ whatever and killed it).  That also makes sense since you say that short jobs 
complete with no problem, but (assumedly) longer jobs get killed like you 
described below -- with signal 9.

You might want to check with your system administrator and see if there are any 
resource limits on user-run applications.


On Jul 22, 2010, at 8:18 PM, Jack Bryan wrote:

> Dear All:
> 
> I run a parallel job on 6 nodes of an OpenMPI cluster. 
> 
> But I got error: 
> 
> rank 0 in job 82  system.cluster_37948   caused collective abort of all ranks
>   exit status of rank 0: killed by signal 9
> 
> It seems that there is segmentation fault on node 0. 
> 
> But, if the program is run for a short time, no problem.
> 
> Any help is appreciated. 
> 
> thanks,
> 
> Jack
> 
> July 22  2010
> 
> The New Busy is not the old busy. Search, chat and e-mail from your inbox. 
> Get started. _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to