>> Hello,
>>
>> I have MPICH2 tightly
>
>Which version? It should work out-of-the-box with SGE.

Version is 1.4 and yes it does have built in integration.


>> integrated with OGS 2011.11.  Everything is working great in general.  I 
>> have noticed when I submit a moderate number of small MPI jobs (e.g. 100 
>> jobs each using two cores) that I will get intermittent commlib errors like:
>> commlib error: got select error (Broken pipe)
>> executing task of job 138060 failed: failed sending task to 
>> [email protected]: can't find connection
>
>This sounds like a network problem unrelated to SGE. Do you use a private 
>network inside the cluster or can you outline the network configuration - do 
>you have a dedicated switch for the cluster?

Dedicated switch. One node is elsewhere on the LAN, but I see this error come 
up between two nodes on the dedicated switch. None of the nodes show packet 
errors. 

>> Sometimes I get "Connection reset by peer"
>
>Which startup of slave tasks do you use, i.e.:
>
>$ qconf -sconf
>...
>qlogin_command               builtin
>qlogin_daemon                builtin
>rlogin_command               builtin
>rlogin_daemon                builtin
>rsh_command                  builtin
>rsh_daemon                   builtin
>
>It sound like an SSH problem with your mentioned output above and your 
>settings could be different.

I am indeed using SSH with a wrapper script for adding the group ID:

qlogin_command               /usr/global/bin/qlogin-wrapper
qlogin_daemon                /usr/global/bin/rshd-wrapper
rlogin_command               /usr/bin/ssh
rlogin_daemon                /usr/global/bin/rshd-wrapper
rsh_command                  /usr/bin/ssh
rsh_daemon                   /usr/global/bin/rshd-wrapper


>> instead of "Broken pipe". I have the allocation rule set to round robin, so 
>> the second process is always spawned on a remote host.
>
>For small jobs I would configure it to run on only one machine - unless they 
>create large scratch files.

Yes but I would like to have a single MPI parallel environment, and in general 
round robin is the best option for my setup. 

Thanks,
Brendan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to