Am 08.11.2012 um 10:32 schrieb Brendan Moloney:
>>> Hello,
>>>
>>> I have MPICH2 tightly
>>
>> Which version? It should work out-of-the-box with SGE.
>
> Version is 1.4 and yes it does have built in integration.
>
>
>>> integrated with OGS 2011.11. Everything is working great in general. I
>>> have noticed when I submit a moderate number of small MPI jobs (e.g. 100
>>> jobs each using two cores) that I will get intermittent commlib errors like:
>>> commlib error: got select error (Broken pipe)
>>> executing task of job 138060 failed: failed sending task to
>>> [email protected]: can't find connection
>>
>> This sounds like a network problem unrelated to SGE. Do you use a private
>> network inside the cluster or can you outline the network configuration - do
>> you have a dedicated switch for the cluster?
>
> Dedicated switch. One node is elsewhere on the LAN, but I see this error come
> up between two nodes on the dedicated switch. None of the nodes show packet
> errors.
>
>>> Sometimes I get "Connection reset by peer"
After a long time or instantly? There are some setting in ssh to avoid a
timeout in ssh_config resp. ~/.ssh/config:
Host *
Compression yes
ServerAliveInterval 900
>> Which startup of slave tasks do you use, i.e.:
>>
>> $ qconf -sconf
>> ...
>> qlogin_command builtin
>> qlogin_daemon builtin
>> rlogin_command builtin
>> rlogin_daemon builtin
>> rsh_command builtin
>> rsh_daemon builtin
>>
>> It sound like an SSH problem with your mentioned output above and your
>> settings could be different.
>
> I am indeed using SSH with a wrapper script for adding the group ID:
>
> qlogin_command /usr/global/bin/qlogin-wrapper
> qlogin_daemon /usr/global/bin/rshd-wrapper
> rlogin_command /usr/bin/ssh
> rlogin_daemon /usr/global/bin/rshd-wrapper
> rsh_command /usr/bin/ssh
> rsh_daemon /usr/global/bin/rshd-wrapper
It's also possible to set different methods for each of the three pairs. So,
rsh_command/rsh_daemon could be set to builtin and the others left as they are.
Would this be appropriate for your intended setup of X11 forwarding?
-- Reuti
>>> instead of "Broken pipe". I have the allocation rule set to round robin, so
>>> the second process is always spawned on a remote host.
>>
>> For small jobs I would configure it to run on only one machine - unless they
>> create large scratch files.
>
> Yes but I would like to have a single MPI parallel environment, and in
> general round robin is the best option for my setup.
>
> Thanks,
> Brendan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users