Running >10 jobs in parallel on the same remote machine

Ole Tange Fri, 03 Aug 2012 06:56:41 -0700

Based on 
http://stackoverflow.com/questions/11576025/how-can-i-limit-the-rate-of-new-outgoing-ssh-connections-when-using-gnu-parallel
and on my own needs I am trying to figure out the best way to spawn
more than 10 jobs on remote servers. At my work we have servers with
48 cores and we would really like to use them.


The problem is MaxStartups in OpenSSH which defaults to 10:30:60, and
which we cannot assume can be changed by the users of GNU Parallel.

If more than 10 new ssh sessions are unauthenticated, they start to be
dropped. Authentication is part of the login, so you can easily have
more than 10 sessions running, but if you have more than 10 sessions
starting at the same time, some of them will be dropped.

The basic idea for solving it is to insert a delay between each ssh to
the same host. Computing the amount of the delay may be a bit more
black magic.

The time to login depends on network latency and the speed of the
remote host. We can measure the combination of that by running a
single 'ssh server true' to the remote host.

A normal 'ssh server true' for me takes 0.1 sec.

Using `tc qdisc add dev eth0 root netem delay 500ms` I can emulate a
connection with 500 ms latency. Then the login takes 8.2 sec.

In theory we could start a new ssh every 0.01 sec (for the normal) and
every 0.82 sec for the 500 ms delayed server. This would stay within
the limit of 10 unauthenticated sessions. But if anyone else would be
logging in at the same time then one of the connections could fail.

So to play it safe(r) I would prefer we only use 50% of the estimated
capacity; thus at most only running a new ssh every 0.02 sec (for
normal) and 1.64 sec (for 500 ms delayed). Jobs running for a few
seconds will be hit by this, but it will not make a difference for
longer running jobs.


/Ole

Running >10 jobs in parallel on the same remote machine

Reply via email to