After some further fiddling, I should redefine the problem. It's not that I'm having problems when connecting simultaneously to a large number of hosts, but rather I'm having problems when doing so over a forwarded ssh connection.
so I'm doing this: ssh-agent (localmachine) -> jump server -> 1300 hosts the 1300 hosts have to contact my ssh-agent by going back through the jump server, then back to my local machine. It just so happens that this site has 2 jump servers. So I ran this test: ssh-agent (jump1) -> 1300 hosts this works perfectly. I can contact all 1300 hosts no problem when running the agent on this machine. I had theorized that this was because the latency was so low that connections got in and out before any limits were reached. However, I then test this setup: ssh-agent (jump1) -> jump2 (same DC/same rack) -> 1300 hosts And this fails in the same manner as it does when running the agent on my local machine. I also notice something interesting in netstat. Typical connections show like this (on jump1): unix 3 [ ] STREAM CONNECTED 8379279 18988/ssh unix 3 [ ] STREAM CONNECTED 8379536 18961/ssh-agent /tmp/ssh-Ahvjg18960/agent.18960 unix 3 [ ] STREAM CONNECTED 8379277 18988/ssh unix 3 [ ] STREAM CONNECTED 8379534 18961/ssh-agent /tmp/ssh-Ahvjg18960/agent.18960 So there is a pair of connections (I'm assuming one connection to the agent, and one through ssh through the forwarded connection). Indeed, 18988 is the pid of my ssh connection to jump2, 18961 is the pid of the agent. But, I see lots of this sort of thing as well: unix 3 [ ] STREAM CONNECTED 8379337 18988/ssh unix 4 [ ] STREAM CONNECTING 0 - /tmp/ssh-Ahvjg18960/agent.18960 unix 3 [ ] STREAM CONNECTED 8379335 18988/ssh unix 4 [ ] STREAM CONNECTING 0 - /tmp/ssh-Ahvjg18960/agent.18960 Is the ssh-agent running as a user, or as root? Can you verify that the > user's limits aren't getting in the way (ulimit -a). You've confirmed > with /proc/sys/fs/file-nr that you're not running into limits there? I'm running as a user, but I have tried it as root. I've also set up as many ulimit parameters to unlimited as possible, and open file limits up to very large numbers. No problems with limits with what /proc/sys/fs/file-nr reports. > Another clue to the puzzle. I have 1300 or so machines in a DC in Hong > Kong, only available through a jump server in the same DC. If I'm running > my agent on my local machine, through the jump server, and connect to all > the machines, connections time out, agent locks up, etc. However, if I copy > my keys to the jump box, and run the agent from there, no connections fail, > and all connections complete very quickly. I assume that this is because > connections open and close quickly enough that whatever limit I'm hitting > isn't reached (netstat snapshots every second show around 200 max concurrent > connections). Aha. That does sound like it may be helpful information. > > When connecting through the jump server, does it create these hundreds > of simultaneous connections from your host, or a single one to the jump > server which then fans out the connections? Maybe I can answer that with some data. Below should show the number of ssh and number of ssh-agent connections in netstat on both machines. The ssh connections to the 1300 hosts are fanned out from the jump server (jump2 in this case), but all the authentication requests get forwarded back to the machine with the agent. [jump1 ~]# netstat -xp | grep -c ssh\ 590 [jump1 ~]# netstat -xp | grep -c ssh.*ag 542 [jump2 ~]# netstat -xp | grep -c ssh\ 676 [jump2 ~]# netstat -xp | grep -c ssh.*ag 825 I would also verify that entropy is still available on the jump server > and ake sure that the jump server has appropriate settings in > /etc/ssh/sshd_config for AllowAgentForwarding, MaxSessions, and > MaxStartups (see the manpage for sshd_config). AllowAgentForwarding should be fine (I'm using ssh -A in any case), but I'll take a look at some of the other settings. I'll fiddle with some of the sshd settings, there may be something there. With default limits of 10 though, it seems strange that I'd be getting over 600 successful connections in a run. I'll see what happens though. Thanks for the ideas! --Bob /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */