[JIRA] [ec2-plugin] (JENKINS-23705) EC2 slave disconnects on longer running builds

bert...@jpoint.nl (JIRA) Tue, 03 Feb 2015 05:50:14 -0800

I'm experiencing the same problems with EC2 slaves.
We're using a custom AWS Linux AMI and slaves that terminate after 30 minutes of inactivity, instance type C3Large.

At seemingly random moments, slaves lose connectivity.
Sometimes the slaves run fine for a while, sometimes a few lose connectivity in a row.
Symptoms:

no more build output is added in the build console log
the slave goes offline
the slave is accessible trough SSH, but the slave.jar Java process isn't running anymore
We experimented with ClientAliveInterval 15 in the sshd config on the slave; didn't help.

I added process list logging to see what happens.
The slave process disappears without anything strange noticable (except for a disconnect on the master).
This can mean that either the slave Java process terminates unexpectedly, or the ssh connection terminated through a timeout.

Looking at the logging, the latter seems to be happening. Around the second that the slave process disappears from the process list, the following logging appears in /var/log/secure:
Feb 3 11:24:43 ip-10-4-33-150 sshd[2243]: Timeout, client not responding.
Feb 3 11:24:43 ip-10-4-33-150 sshd[2241]: pam_unix(sshd:session): session closed for user ec2-user

That means that sshd is terminating the connection.
On another build environment with pratically the same setup (Ubuntu AMI), we don't see the disconnects.
I compared the two sshd config files on the slaves.
Noticeable difference:

the Ubuntu slave (no disconnects) has "TCPKeepAlive yes" in its sshd_config, and no ClientAliveInterval/ClientAliveCountMax set
the AWS Linux slave (disconnect issues) has TCPKeepAlive not set (commented out), ClientAliveInterval 15 and ClientAliveCountMax not set

The next thing we're going to try is to remove ClientAliveInterval and enable "TCPKeepAlive yes" on the AWS Linux slave.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[JIRA] [ec2-plugin] (JENKINS-23705) EC2 slave disconnects on longer running builds

Reply via email to