On a whim, I tried setting pipelining to true, and found that it 
dramatically reduced the incidence of these issues, but still did not 
resolve them entirely.  

I still see examples of simple commands that take 60+1 seconds to complete. 
 

If anyone is seeing the same issues, definitely speak up.  Thanks!



On Monday, August 8, 2016 at 7:14:01 PM UTC-7, Brandon Heller wrote:
>
> tl;dr; Seeing an Ansible bug where commands take lots of time to start, 
> and then complete quickly; seems to affect Ubuntu VMs.
>
> Full description: I’m using Ansible to orchestrate code deployments. 
>  These deployments frequently take _much_ longer than expected, because 
> individual steps that should take only a few seconds to run (like copying a 
>  small file or getting facts) take up to 4 minutes, where the time is in 
> SSH waits.
>
> The Ansible profiler output shows that commands almost always take some 
> exact multiple of 60 seconds, plus a small, reasonable, expected delta. 
>  Here’s an example:
>
> Monday 08 August 2016  11:26:18 -0700 (0:00:00.650)       0:13:51.907 
> ********* 
>
> ===============================================================================
>  
>
>
> TASK: appserver : Copy web jar file from local ------------------------ 
> 242.42s
>
> TASK: setup  ---------------------------------------------------------- 
> 242.21s
>
> TASK: localhost_keystore : Create the forward directory --------------- 
> 181.18s
>
> TASK: appserver : Copy web dependencies jar file from local ------------ 
> 66.84s
>
> TASK: localhost_keystore : Copy the keystore file ---------------------- 
> 60.84s
>
> TASK: Wait for app server to accept incoming HTTPS connections ---------- 
> 9.74s
>
> This problem occurs across playbooks, servers, and tasks - pretty much 
> everywhere.  All are Ubuntu 14.04 server instances.  
>
> [I’ve seen a possibly related problem locally on my Mac laptop, when using 
> a script to create a Docker image.  When Ansible would connect right after 
> the docker image had started up, there was a race condition where the 
> connection could occur as the SSH server was starting up, and I would often 
> see a similar added 60 second delay.  This problem went away by delaying 
> the initial connection attempt by 5 seconds to give enough time for the SSH 
> server to fully start up. ]
>
> There’s one 60 second default delay in the source, which is related to 
> these delays:
>
> ansible/constants.py:ANSIBLE_SSH_ARGS = get_config(p, 'ssh_connection', 
> 'ssh_args', 'ANSIBLE_SSH_ARGS', '-o ControlMaster=auto -o 
> ControlPersist=60s’)
>
> The man page for this param:
>
>     ControlPersist
>
>             When used in conjunction with ControlMaster, specifies that 
> the master connection should remain open in the background (waiting for 
> future client connections) after the initial client connection has been 
> closed.  If set to ``no'', then the master connection will not be placed 
> into the background, and will close as soon as the initial client 
> connection is closed.  If set to ``yes'', then the master connection will 
> remain in the background indefinitely (until killed or closed via a 
> mechanism such as the ssh(1) ``-O exit'' option).  If set to a time in 
> seconds, or a time in any of the formats documented in sshd_config(5), then 
> the backgrounded master connection will automatically terminate after it 
> has remained idle (with no client connections) for the specified time.
>
> Looking in the SSH logs for a machine where this was occurring (with 60s 
> value), I see examples like the below where the session is immediately 
> closed, but then later opened, potentially multiple times, even for simple 
> things like setup (getting facts for a playbook).
>
> Aug  8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session 
> opened for user root by forward(uid=0)
>
> Aug  8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session 
> closed for user root
>
> Aug  8 13:36:20 lohika-linux-2 sshd[2265]: Received disconnect from 
> 10.128.0.116: 11: disconnected by user
>
> Aug  8 13:36:20 lohika-linux-2 sshd[2216]: pam_unix(sshd:session): session 
> closed for user forward
>
> Aug  8 13:36:21 lohika-linux-2 sshd[2361]: Accepted publickey for forward 
> from 10.128.0.116 port 40208 ssh2: RS
>
> Other times, you just see a ‘session opened’ message but no immediate 
> 'session closed’.
>
> Here’s an example of Ansible’s output with full verbosity when the delay 
> occurred (set to 25s); 3 out of 4 targets saw the delay but I’m showing one 
> fast (lohika-linux-4) example and one slow example (lohika-linux-3) in the 
> hopes that something pops out.
>
> TASK [setup] 
> *******************************************************************
>
> Monday 08 August 2016  15:05:59 -0700 (0:00:00.190)       0:01:46.080 
> ********* 
>
> <lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward
>
> <lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt 
> lohika-linux-4 ( umask 22 && mkdir -p "$( echo 
> $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" && echo "$( 
> echo $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" )
>
> <lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward
>
> <lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt 
> lohika-linux-3 ( umask 22 && mkdir -p "$( echo 
> $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" && echo "$( 
> echo $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" )
>
> <lohika-linux-4> PUT /tmp/tmpCVuqVg TO 
> /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup
>
> <lohika-linux-4> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpCVuqVg 
> [lohika-linux-4]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup
>
> <lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward
>
> <lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt 
> lohika-linux-4 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
> /usr/bin/python 
> /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup; 
> rm -rf 
> "/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/" > 
> /dev/null 2>&1
>
> ok: [lohika-linux-4]
>
> <lohika-linux-3> PUT /tmp/tmpEbpIWo TO 
> /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup
>
> <lohika-linux-3> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpEbpIWo 
> [lohika-linux-3]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup
>
> <lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward
>
> <lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o 
> ControlPersist=25s -o StrictHostKeyChecking=no -o 
> 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o 
> KbdInteractiveAuthentication=no -o 
> PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
> -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o 
> ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt 
> lohika-linux-3 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
> /usr/bin/python 
> /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup; 
> rm -rf 
> "/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/" > 
> /dev/null 2>&1
>
> ok: [lohika-linux-3]
>
> When changing this parameter via ansible.cfg, I definitely see an effect, 
> in that the delay lines up with the ControlMaster change.  For example, 
> dropping to 25s, I see 26s and 51s delays, and the rest effectively zero.
>
> However when I drop this param to 10s, I find that lots of commands that 
> previously had sub-second delays then take at least the ControlPersist time:
>
> Monday 08 August 2016  14:32:02 -0700 (0:00:00.053)       0:03:50.756 
> ********* 
>
> ===============================================================================
>  
>
>
> TASK: Wait for VM to boot ---------------------------------------------- 
> 42.17s
>
> TASK: setup  ----------------------------------------------------------- 
> 31.22s
>
> TASK: stat path=/etc/ec2_version --------------------------------------- 
> 30.92s
>
> TASK: Verify that parser_dev_web can auth to the Forward server -------- 
> 21.50s
>
> TASK: Ensure server is accepting incoming web connections -------------- 
> 21.01s
>
> TASK: Ensure that the server port is accessible ------------------------ 
> 20.94s
>
> TASK: setup  ----------------------------------------------------------- 
> 10.98s
>
> TASK: setup  ----------------------------------------------------------- 
> 10.97s
>
> TASK: Wait for app server to accept incoming HTTPS connections --------- 
> 10.89s
>
> TASK: Check the status of the libvirt-bin service ---------------------- 
> 10.69s
>
> TASK: Wait for 5 seconds before checking libvirt-bin status ------------- 
> 5.03s
>
> TASK: Delay before checking for the boot status of the VM ——————— 5.02s
>
> Or, for 20s:
>
> TASK: setup  ----------------------------------------------------------- 
> 61.18s
>
> TASK: stat path=/etc/ec2_version --------------------------------------- 
> 60.94s
>
> TASK: Check the status of the libvirt-bin service ---------------------- 
> 60.94s
>
> TASK: setup  ----------------------------------------------------------- 
> 41.22s
>
> TASK: Wait for VM to boot ---------------------------------------------- 
> 27.75s
>
> TASK: Wait for app server to accept incoming HTTPS connections --------- 
> 20.94s
>
> TASK: Ensure server is accepting incoming web connections -------------- 
> 20.92s
>
> TASK: Ensure that the server port is accessible ------------------------ 
> 20.71s
>
> So at this point I have a semi-workaround (use 25-30s), which helps the 
> worst case quite a bit, while only occasionally hurting the typical case.
>
> This certainly feels like either a bug in Ansible’s SSH handling for some 
> condition that I’m seeing when running on Ubuntu, or some issue in my SSH 
> client config that triggers errors.  It does seem like the problem occurs 
> more frequently when the underlying VM host for the target VMs is more 
> loaded.  I can’t seem to replicate the problem locally on a Mac.
>
> I’ve confirmed that the issue is not Reverse DNS lookups; the target 
> servers all have UseDNS no set in /etc/ssh/sshd_config, plus -u0 set to 
> further prevent Reverse DNS.
>
> The client version used on Ubuntu is the latest one for 14.04 LTS, which 
> is somewhat old: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7, OpenSSL 1.0.1f 6 Jan 
> 2014
>
> For most of these tests, I was using Ansible v2.0.0-0.7.rc2; switching to 
> v2.1.1.0-1 produced identical effects.
>
> Anyone have ideas, or seen anything similar?  
>
> Much appreciated!
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ansible-project+unsubscr...@googlegroups.com.
To post to this group, send email to ansible-project@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/f3e79601-4b6d-45a5-b698-f50298a3213c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to