>> Hello, >> >> I have MPICH2 tightly > >Which version? It should work out-of-the-box with SGE.
Version is 1.4 and yes it does have built in integration. >> integrated with OGS 2011.11. Everything is working great in general. I >> have noticed when I submit a moderate number of small MPI jobs (e.g. 100 >> jobs each using two cores) that I will get intermittent commlib errors like: >> commlib error: got select error (Broken pipe) >> executing task of job 138060 failed: failed sending task to >> [email protected]: can't find connection > >This sounds like a network problem unrelated to SGE. Do you use a private >network inside the cluster or can you outline the network configuration - do >you have a dedicated switch for the cluster? Dedicated switch. One node is elsewhere on the LAN, but I see this error come up between two nodes on the dedicated switch. None of the nodes show packet errors. >> Sometimes I get "Connection reset by peer" > >Which startup of slave tasks do you use, i.e.: > >$ qconf -sconf >... >qlogin_command builtin >qlogin_daemon builtin >rlogin_command builtin >rlogin_daemon builtin >rsh_command builtin >rsh_daemon builtin > >It sound like an SSH problem with your mentioned output above and your >settings could be different. I am indeed using SSH with a wrapper script for adding the group ID: qlogin_command /usr/global/bin/qlogin-wrapper qlogin_daemon /usr/global/bin/rshd-wrapper rlogin_command /usr/bin/ssh rlogin_daemon /usr/global/bin/rshd-wrapper rsh_command /usr/bin/ssh rsh_daemon /usr/global/bin/rshd-wrapper >> instead of "Broken pipe". I have the allocation rule set to round robin, so >> the second process is always spawned on a remote host. > >For small jobs I would configure it to run on only one machine - unless they >create large scratch files. Yes but I would like to have a single MPI parallel environment, and in general round robin is the best option for my setup. Thanks, Brendan _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
