I hate to reply to myself, but would really appreciate some guidance on how to 
debug this.

To recap:

- Spawning large numbers of MPI jobs with the "round_robin" allocation rule 
results in intermittent commlib errors ("failed sending task to execd at 
node1.ohsu.edu: can't find connection"). 

- No errors reported by ifconfig or the switch. No TCP/UDP checksum errors are 
reported when the issue occurs. The error occurs between different pairs of 
execution hosts and it occurs when using different ports on the switch.

- Increasing network bandwidth (bonded ethernet) decreases the occurrence of 
these errors (and/or increases the number of simultaneous MPI jobs required to 
generate the error).


For now I work around this by specifying  a new parallel environment "mpi-fill" 
that uses the "fill_up" allocation rule, and then I use this PE for small MPI 
jobs. However, I would still like to know the root cause of this issue.

Thanks,
Brendan
________________________________________
From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of 
Brendan Moloney [molo...@ohsu.edu]
Sent: Tuesday, December 11, 2012 2:08 AM
To: Reuti
Cc: users@gridengine.org
Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs

Hello again,

I got a chance to run some more tests. I can recreate the problem with 
different ports on the switch, and I can recreate it between different pairs of 
nodes.  I also used tcpdump to look for bad checksums (while recreating the 
commlib error) and got nothing. Is it still possible this is a hardware issue? 
I haven't noticed any other network stability issues (e.g. during the 
communication between MPI processes).

I also installed a new server (so more load on the network) and was able to 
recreate the problem more often.  I then set up Ethernet bonding to increase 
the bandwidth from the NFS file server (where both the data being processed and 
master spool is located) and the problem started to occur much less often.

Any further help is greatly appreciated.

Thanks,
Brendan
________________________________________
From: Reuti [re...@staff.uni-marburg.de]
Sent: Wednesday, November 14, 2012 10:02 AM
To: Brendan Moloney
Cc: users@gridengine.org
Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs

Am 14.11.2012 um 00:56 schrieb Brendan Moloney:

> Ok I will test that out once I can schedule some down time.  I might even be 
> able to get my hands on another switch by then.

Depending on your NFS setup you can also change this on-the-fly.

-- Reuti


> I appreciate all the help.
> ________________________________________
> From: Reuti [re...@staff.uni-marburg.de]
> Sent: Tuesday, November 13, 2012 3:33 AM
> To: Brendan Moloney
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs
>
> Am 12.11.2012 um 22:03 schrieb Brendan Moloney:
>
>> I suppose it could be the switch.  Is the only way to test this to swap it 
>> out for a different switch?
>
> Are all ports used on the switch? Change the used ports.
>
> -- Reuti
>
>
>> Thanks again,
>> Brendan
>> ________________________________________
>> From: Reuti [re...@staff.uni-marburg.de]
>> Sent: Monday, November 12, 2012 4:17 AM
>> To: Brendan Moloney
>> Cc: users@gridengine.org
>> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs
>>
>> Am 10.11.2012 um 00:31 schrieb Brendan Moloney:
>>
>>> I spent some time researching this issue in the context of OpenSSH and 
>>> found some mentions of similar problems due to the initial handshake 
>>> package being too large 
>>> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer).
>>>   I was dubious that this was my problem but after manually specifying the 
>>> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the 
>>> number of submissions I have done now I would expect to have seen the issue 
>>> several times, so I am fairly sure it is fixed.  Will keep an eye on it of 
>>> course.
>>>
>>>>>>> Sometimes I get "Connection reset by peer"
>>>>
>>>> After a long time or instantly? There are some setting in ssh to avoid a 
>>>> timeout in ssh_config resp. ~/.ssh/config:
>>>>
>>>> Host *
>>>> Compression yes
>>>> ServerAliveInterval 900
>>>
>>> Seems to happen fast enough that it is not a timeout issue.
>>>
>>>>> I am indeed using SSH with a wrapper script for adding the group ID:
>>>>>
>>>>> qlogin_command               /usr/global/bin/qlogin-wrapper
>>>>> qlogin_daemon                /usr/global/bin/rshd-wrapper
>>>>> rlogin_command               /usr/bin/ssh
>>>>> rlogin_daemon                /usr/global/bin/rshd-wrapper
>>>>> rsh_command                  /usr/bin/ssh
>>>>> rsh_daemon                   /usr/global/bin/rshd-wrapper
>>>
>>>> It's also possible to set different methods for each of the three pairs. 
>>>> So, rsh_command/rsh_daemon could be set to builtin and the others left as 
>>>> they are. Would this be appropriate for your intended setup of X11 
>>>> forwarding?
>>>
>>> So using the builtin option would still allow enforcement of memory/time 
>>> limits on parallel jobs?
>>
>> The ones set by SGE - yes.
>>
>> To the original problem: can it be a problem in the switch?
>>
>> -- Reuti
>>
>
>


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to