Hello,

I have MPICH2 tightly integrated with OGS 2011.11.  Everything is working great 
in general.  I have noticed when I submit a moderate number of small MPI jobs 
(e.g. 100 jobs each using two cores) that I will get intermittent commlib 
errors like:

commlib error: got select error (Broken pipe)
executing task of job 138060 failed: failed sending task to 
[email protected]: can't find connection

Sometimes I get "Connection reset by peer" instead of "Broken pipe". I have the 
allocation rule set to round robin, so the second process is always spawned on 
a remote host. The cluster is small, just four servers (72 cores) on gigabit 
ethernet. The master spool is on NFS while the local spool is on a local drive. 

Any advice on how to debug this would be greatly appreciated.

Thanks!
Brendan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to