Ok I will test that out once I can schedule some down time. I might even be able to get my hands on another switch by then.
I appreciate all the help. ________________________________________ From: Reuti [[email protected]] Sent: Tuesday, November 13, 2012 3:33 AM To: Brendan Moloney Cc: [email protected] Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs Am 12.11.2012 um 22:03 schrieb Brendan Moloney: > I suppose it could be the switch. Is the only way to test this to swap it > out for a different switch? Are all ports used on the switch? Change the used ports. -- Reuti > Thanks again, > Brendan > ________________________________________ > From: Reuti [[email protected]] > Sent: Monday, November 12, 2012 4:17 AM > To: Brendan Moloney > Cc: [email protected] > Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs > > Am 10.11.2012 um 00:31 schrieb Brendan Moloney: > >> I spent some time researching this issue in the context of OpenSSH and found >> some mentions of similar problems due to the initial handshake package being >> too large >> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer). >> I was dubious that this was my problem but after manually specifying the >> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the >> number of submissions I have done now I would expect to have seen the issue >> several times, so I am fairly sure it is fixed. Will keep an eye on it of >> course. >> >>>>>> Sometimes I get "Connection reset by peer" >>> >>> After a long time or instantly? There are some setting in ssh to avoid a >>> timeout in ssh_config resp. ~/.ssh/config: >>> >>> Host * >>> Compression yes >>> ServerAliveInterval 900 >> >> Seems to happen fast enough that it is not a timeout issue. >> >>>> I am indeed using SSH with a wrapper script for adding the group ID: >>>> >>>> qlogin_command /usr/global/bin/qlogin-wrapper >>>> qlogin_daemon /usr/global/bin/rshd-wrapper >>>> rlogin_command /usr/bin/ssh >>>> rlogin_daemon /usr/global/bin/rshd-wrapper >>>> rsh_command /usr/bin/ssh >>>> rsh_daemon /usr/global/bin/rshd-wrapper >> >>> It's also possible to set different methods for each of the three pairs. >>> So, rsh_command/rsh_daemon could be set to builtin and the others left as >>> they are. Would this be appropriate for your intended setup of X11 >>> forwarding? >> >> So using the builtin option would still allow enforcement of memory/time >> limits on parallel jobs? > > The ones set by SGE - yes. > > To the original problem: can it be a problem in the switch? > > -- Reuti > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
