Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reuti Thu, 08 Nov 2012 01:42:23 -0800

Am 08.11.2012 um 10:32 schrieb Brendan Moloney:

>>> Hello,
>>> 
>>> I have MPICH2 tightly
>> 
>> Which version? It should work out-of-the-box with SGE.
> 
> Version is 1.4 and yes it does have built in integration.
> 
> 
>>> integrated with OGS 2011.11.  Everything is working great in general.  I 
>>> have noticed when I submit a moderate number of small MPI jobs (e.g. 100 
>>> jobs each using two cores) that I will get intermittent commlib errors like:
>>> commlib error: got select error (Broken pipe)
>>> executing task of job 138060 failed: failed sending task to 
>>> [email protected]: can't find connection
>> 
>> This sounds like a network problem unrelated to SGE. Do you use a private 
>> network inside the cluster or can you outline the network configuration - do 
>> you have a dedicated switch for the cluster?
> 
> Dedicated switch. One node is elsewhere on the LAN, but I see this error come 
> up between two nodes on the dedicated switch. None of the nodes show packet 
> errors. 
> 
>>> Sometimes I get "Connection reset by peer"


After a long time or instantly? There are some setting in ssh to avoid a 
timeout in ssh_config resp. ~/.ssh/config:

Host *
    Compression yes
    ServerAliveInterval 900



>> Which startup of slave tasks do you use, i.e.:
>> 
>> $ qconf -sconf
>> ...
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> 
>> It sound like an SSH problem with your mentioned output above and your 
>> settings could be different.
> 
> I am indeed using SSH with a wrapper script for adding the group ID:
> 
> qlogin_command               /usr/global/bin/qlogin-wrapper
> qlogin_daemon                /usr/global/bin/rshd-wrapper
> rlogin_command               /usr/bin/ssh
> rlogin_daemon                /usr/global/bin/rshd-wrapper
> rsh_command                  /usr/bin/ssh
> rsh_daemon                   /usr/global/bin/rshd-wrapper

It's also possible to set different methods for each of the three pairs. So, 
rsh_command/rsh_daemon could be set to builtin and the others left as they are. 
Would this be appropriate for your intended setup of X11 forwarding?

-- Reuti


>>> instead of "Broken pipe". I have the allocation rule set to round robin, so 
>>> the second process is always spawned on a remote host.
>> 
>> For small jobs I would configure it to run on only one machine - unless they 
>> create large scratch files.
> 
> Yes but I would like to have a single MPI parallel environment, and in 
> general round robin is the best option for my setup. 
> 
> Thanks,
> Brendan


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reply via email to