I suppose it could be the switch.  Is the only way to test this to swap it out 
for a different switch?

Thanks again,
Brendan
________________________________________
From: Reuti [[email protected]]
Sent: Monday, November 12, 2012 4:17 AM
To: Brendan Moloney
Cc: [email protected]
Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs

Am 10.11.2012 um 00:31 schrieb Brendan Moloney:

> I spent some time researching this issue in the context of OpenSSH and found 
> some mentions of similar problems due to the initial handshake package being 
> too large 
> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer).
>   I was dubious that this was my problem but after manually specifying the 
> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the 
> number of submissions I have done now I would expect to have seen the issue 
> several times, so I am fairly sure it is fixed.  Will keep an eye on it of 
> course.
>
>>>>> Sometimes I get "Connection reset by peer"
>>
>> After a long time or instantly? There are some setting in ssh to avoid a 
>> timeout in ssh_config resp. ~/.ssh/config:
>>
>> Host *
>>   Compression yes
>>   ServerAliveInterval 900
>
> Seems to happen fast enough that it is not a timeout issue.
>
>>> I am indeed using SSH with a wrapper script for adding the group ID:
>>>
>>> qlogin_command               /usr/global/bin/qlogin-wrapper
>>> qlogin_daemon                /usr/global/bin/rshd-wrapper
>>> rlogin_command               /usr/bin/ssh
>>> rlogin_daemon                /usr/global/bin/rshd-wrapper
>>> rsh_command                  /usr/bin/ssh
>>> rsh_daemon                   /usr/global/bin/rshd-wrapper
>
>> It's also possible to set different methods for each of the three pairs. So, 
>> rsh_command/rsh_daemon could be set to builtin and the others left as they 
>> are. Would this be appropriate for your intended setup of X11 forwarding?
>
> So using the builtin option would still allow enforcement of memory/time 
> limits on parallel jobs?

The ones set by SGE - yes.

To the original problem: can it be a problem in the switch?

-- Reuti

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to