In the last 12 months, we have encountered a very rare, but also very critical issue in Jenkins core or the SSH-Slaves plug-in.
The issue is that Jenkins spontaneously enters a state in which it reproducibly selects the wrong channel for some of its connected build hosts. All build hosts are connected via the SSH-Slaves plug-in.
This immediately leads failing builds, as they will not respect the workspace locks anymore, as they lock them on the correct host, but talk with a different host to execute builds.
A typical log-output looks like this:
-------------------
14:45:35 Started by command line by <user>
14:45:35 Building remotely on musxbird039 in workspace /local/jenkins_workspace/workspace/<PROJECT>
[...]
14:45:35 Checkout:<GIT-REPO> / /local/jenkins_workspace/workspace/<PROJECT> - hudson.remoting.Channel@37ecb28e:musxbird029
-------------------
As you can see, it selects musxbird039 for building, but uses the channel to musxbird029. Since the workspace is usually physically present on those machines, too, the build starts. Unfortunately, since the workspace is only locked on musxbird039, but not on musxbird029, a collision can occur freely.
This leads, of course, to a vast variety of build failures.
We do not know of a way to reliably reproduce this issue, as it appears randomly after some time. Sometimes it takes months to appear, sometimes only days.
The only known way of repairing the bug is to disconnect both machines and let them restart their slaves and all their associated threads on the Jenkins master. Rebooting the server itself obviously also works.
We are quite frankly stumped by this bug. Even examining the slave->channel allocation code of Jenkins ourselves did not lead to any clue.
If you need more information, we will be happy to give them.
Best regards,
Martin Schröder
Intel Mobile Communications GmbH.
|