Based on http://stackoverflow.com/questions/11576025/how-can-i-limit-the-rate-of-new-outgoing-ssh-connections-when-using-gnu-parallel and on my own needs I am trying to figure out the best way to spawn more than 10 jobs on remote servers. At my work we have servers with 48 cores and we would really like to use them.
The problem is MaxStartups in OpenSSH which defaults to 10:30:60, and which we cannot assume can be changed by the users of GNU Parallel. If more than 10 new ssh sessions are unauthenticated, they start to be dropped. Authentication is part of the login, so you can easily have more than 10 sessions running, but if you have more than 10 sessions starting at the same time, some of them will be dropped. The basic idea for solving it is to insert a delay between each ssh to the same host. Computing the amount of the delay may be a bit more black magic. The time to login depends on network latency and the speed of the remote host. We can measure the combination of that by running a single 'ssh server true' to the remote host. A normal 'ssh server true' for me takes 0.1 sec. Using `tc qdisc add dev eth0 root netem delay 500ms` I can emulate a connection with 500 ms latency. Then the login takes 8.2 sec. In theory we could start a new ssh every 0.01 sec (for the normal) and every 0.82 sec for the 500 ms delayed server. This would stay within the limit of 10 unauthenticated sessions. But if anyone else would be logging in at the same time then one of the connections could fail. So to play it safe(r) I would prefer we only use 50% of the estimated capacity; thus at most only running a new ssh every 0.02 sec (for normal) and 1.64 sec (for 500 ms delayed). Jobs running for a few seconds will be hit by this, but it will not make a difference for longer running jobs. /Ole
