Hey everyone,

I have multiple playbooks that runs on a schedule on lots of hosts, some 
are sometimes turned off for cost saving.

Almost all jobs on AWX are marked as failed because there is at least 1 
host that is powered off. Which is not very aesthetically  pleasing  and 
also hard to know when a job has actually failed on an important task on a 
host.

Another inconvenience is that the jobs take a lot of time to execute when 
there are lots of hosts that are unreachable, because ansible hangs on them 
and waits for the connection. I tried decreasing the timeout settings in 
our ansible.cfg to 20 seconds which did help a bit but the hanging on 
(turned off) hosts take a lot of waiting before the tasks carry on with the 
other hosts.
 
The solution for me was:
- Add a pre_task on each playbook that will run a wait_for_connection task
- Check if it fails then i end the task without proceeding like so

```
- hosts: all
gather_facts: no
pre_tasks:
- name: Check host reachability
wait_for_connection:
timeout: "{{ ssh_timeout_wait_for | default(5) }}"
sleep: 1
ignore_errors: true
ignore_unreachable: true
register: host_is_reachable

- name: End play if host is unreachable
meta: end_play
when: host_is_reachable.failed
roles:
- role: roles/somerole
```
This seems to fix my first problem of jobs been marked as failed if one 
host is unreachable.

But it does not fix my second problem which is ansible hanging on the 
unreachable hosts for so long.

In the the wait_for_connection i have set the timeout to 5 seconds, 
expecting that the ansible should try and reach the host but if it fails to 
do so in 5 seconds it should end the play. But it doe not do that.

Instead ansible hangs on the unreachable host for more than 2 minutes 
throws a warning like this:
WARNING]: Unhandled error in Python interpreter discovery for host
172.12.23.34: Failed to connect to the host via ssh: ssh: connect to host

And then waits some extra time and then the output of the 
wait_for_connection task gets printed like so:
TASK [Check host reachability] 
*************************************************
fatal: [172.12.23.34]: FAILED! => {"changed": false,
"elapsed": 169, "msg": "timed out waiting for ping module test:
Data could not be sent to remote host \"172.12.23.34\".
Make sure this host can be reached over ssh: ssh:
connect to host 172.12.23.34 port 22: Connection timed out\r\n"}
...ignoring

As you can see in the task output the wait_for_connection alone waited for 
169 seconds even after specifying a way lower value.

Am i doing something wrong? Is this the default behavior?

Extra questions:
- Is this because ansible tries to facts gather before even starting the 
wait_for task? that was the reason i put the wait_for_connection in a 
pre_task.
- Is the 169 seconds not random and it has to do with the default timeout 
ssh settings? i get different values every time i run the playbook so i 
don't think so.
- Please share with me any alternative approach to fix to first 2 problems.

Any help would be appreciated :)

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ansible-project+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/09688194-d7f2-4d89-9935-d7b8c326dd6cn%40googlegroups.com.

Reply via email to