I hate to reply to myself, but would really appreciate some guidance on how to debug this.
To recap: - Spawning large numbers of MPI jobs with the "round_robin" allocation rule results in intermittent commlib errors ("failed sending task to execd at node1.ohsu.edu: can't find connection"). - No errors reported by ifconfig or the switch. No TCP/UDP checksum errors are reported when the issue occurs. The error occurs between different pairs of execution hosts and it occurs when using different ports on the switch. - Increasing network bandwidth (bonded ethernet) decreases the occurrence of these errors (and/or increases the number of simultaneous MPI jobs required to generate the error). For now I work around this by specifying a new parallel environment "mpi-fill" that uses the "fill_up" allocation rule, and then I use this PE for small MPI jobs. However, I would still like to know the root cause of this issue. Thanks, Brendan ________________________________________ From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of Brendan Moloney [molo...@ohsu.edu] Sent: Tuesday, December 11, 2012 2:08 AM To: Reuti Cc: users@gridengine.org Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs Hello again, I got a chance to run some more tests. I can recreate the problem with different ports on the switch, and I can recreate it between different pairs of nodes. I also used tcpdump to look for bad checksums (while recreating the commlib error) and got nothing. Is it still possible this is a hardware issue? I haven't noticed any other network stability issues (e.g. during the communication between MPI processes). I also installed a new server (so more load on the network) and was able to recreate the problem more often. I then set up Ethernet bonding to increase the bandwidth from the NFS file server (where both the data being processed and master spool is located) and the problem started to occur much less often. Any further help is greatly appreciated. Thanks, Brendan ________________________________________ From: Reuti [re...@staff.uni-marburg.de] Sent: Wednesday, November 14, 2012 10:02 AM To: Brendan Moloney Cc: users@gridengine.org Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs Am 14.11.2012 um 00:56 schrieb Brendan Moloney: > Ok I will test that out once I can schedule some down time. I might even be > able to get my hands on another switch by then. Depending on your NFS setup you can also change this on-the-fly. -- Reuti > I appreciate all the help. > ________________________________________ > From: Reuti [re...@staff.uni-marburg.de] > Sent: Tuesday, November 13, 2012 3:33 AM > To: Brendan Moloney > Cc: users@gridengine.org > Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs > > Am 12.11.2012 um 22:03 schrieb Brendan Moloney: > >> I suppose it could be the switch. Is the only way to test this to swap it >> out for a different switch? > > Are all ports used on the switch? Change the used ports. > > -- Reuti > > >> Thanks again, >> Brendan >> ________________________________________ >> From: Reuti [re...@staff.uni-marburg.de] >> Sent: Monday, November 12, 2012 4:17 AM >> To: Brendan Moloney >> Cc: users@gridengine.org >> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs >> >> Am 10.11.2012 um 00:31 schrieb Brendan Moloney: >> >>> I spent some time researching this issue in the context of OpenSSH and >>> found some mentions of similar problems due to the initial handshake >>> package being too large >>> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer). >>> I was dubious that this was my problem but after manually specifying the >>> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the >>> number of submissions I have done now I would expect to have seen the issue >>> several times, so I am fairly sure it is fixed. Will keep an eye on it of >>> course. >>> >>>>>>> Sometimes I get "Connection reset by peer" >>>> >>>> After a long time or instantly? There are some setting in ssh to avoid a >>>> timeout in ssh_config resp. ~/.ssh/config: >>>> >>>> Host * >>>> Compression yes >>>> ServerAliveInterval 900 >>> >>> Seems to happen fast enough that it is not a timeout issue. >>> >>>>> I am indeed using SSH with a wrapper script for adding the group ID: >>>>> >>>>> qlogin_command /usr/global/bin/qlogin-wrapper >>>>> qlogin_daemon /usr/global/bin/rshd-wrapper >>>>> rlogin_command /usr/bin/ssh >>>>> rlogin_daemon /usr/global/bin/rshd-wrapper >>>>> rsh_command /usr/bin/ssh >>>>> rsh_daemon /usr/global/bin/rshd-wrapper >>> >>>> It's also possible to set different methods for each of the three pairs. >>>> So, rsh_command/rsh_daemon could be set to builtin and the others left as >>>> they are. Would this be appropriate for your intended setup of X11 >>>> forwarding? >>> >>> So using the builtin option would still allow enforcement of memory/time >>> limits on parallel jobs? >> >> The ones set by SGE - yes. >> >> To the original problem: can it be a problem in the switch? >> >> -- Reuti >> > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users