|
||||||||
This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira |
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I'm experiencing the same problems with EC2 slaves.
We're using a custom AWS Linux AMI and slaves that terminate after 30 minutes of inactivity, instance type C3Large.
At seemingly random moments, slaves lose connectivity.
Sometimes the slaves run fine for a while, sometimes a few lose connectivity in a row.
Symptoms:
We experimented with ClientAliveInterval 15 in the sshd config on the slave; didn't help.
I added process list logging to see what happens.
The slave process disappears without anything strange noticable (except for a disconnect on the master).
This can mean that either the slave Java process terminates unexpectedly, or the ssh connection terminated through a timeout.
Looking at the logging, the latter seems to be happening. Around the second that the slave process disappears from the process list, the following logging appears in /var/log/secure:
Feb 3 11:24:43 ip-10-4-33-150 sshd[2243]: Timeout, client not responding.
Feb 3 11:24:43 ip-10-4-33-150 sshd[2241]: pam_unix(sshd:session): session closed for user ec2-user
That means that sshd is terminating the connection.
On another build environment with pratically the same setup (Ubuntu AMI), we don't see the disconnects.
I compared the two sshd config files on the slaves.
Noticeable difference:
The next thing we're going to try is to remove ClientAliveInterval and enable "TCPKeepAlive yes" on the AWS Linux slave.