Occasionally, we observe an issue with resuming jobs after a Jenkins Master restart. Our configuration. Jenkins 2.138.1 (some older versions had the same problem) with the init.groovy.d configuration + persistent jobs running on top of the Kubernetes cluster. Slaves/executors are setup with the Kubernetes plugin 1.12.6 on Kubernetes 1.11. After the Jenkins master restart a simple job (a git clone + a shell script) sometimes resume properly:
15:19:18 Commit message: "Quick commit"
15:19:18 > git rev-list --no-walk aaaaaa9bd4a093e0364df0f52f5447c15f3785f0 # timeout=10
[Pipeline] }
[Pipeline] [Pipeline] }
[Pipeline] [Pipeline] stage
[Pipeline] { (Execute my fancy shell script)
[Pipeline] sh
Resuming build at Wed Oct 01 13:19:59 GMT 2018 after Jenkins restart
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
15:19:19 [xxxxx-xxxxx] Running shell script
15:19:22 + vault login '-method=aws' 'role=jenkins'
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #40
Ready to run at Wed Oct 01 13:20:13 GMT 2018
15:20:13 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
15:20:13 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx):
15:20:14 (script execution performed properly)
(...)
Finished: SUCCESS
However, usually it hangs at:
15:31:50 Running in /home/jenkins/workspace/xxxxx-xxxxx/playbooks
[Pipeline] {
[Pipeline] git
[Pipeline] }
[Pipeline] Resuming build at Wed Oct 01 13:32:39 GMT 2018 after Jenkins restart
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Ready to run at Wed Oct 01 13:33:07 GMT 2018
15:33:07 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
15:33:07 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx):
15:33:07
and it needs to be aborted. From the Jenkins logs of the failed resume I see:
Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Agent has already been launched, activating: {}
Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Agent has already been launched, activating: {}
Oct 01, 2018 1:33:03 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxx-xxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskAccepted
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerStepExecution
onResume
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
Building connection to Kubernetes kubernetes URL namespace default
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
Autoconfiguring Kubernetes client
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress=, namespace=default, caCertData=null, credentials=null, skipTlsVerify=true, connectTimeout=0, readTimeout=0]
Oct 01, 2018 1:33:08 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
Connected to Kubernetes kubernetes URL
the next lines in the successful resume (there are not present on failure) are:
Oct 01, 2018 1:20:15 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launch proc with environment: [AGENT_WORKDIR=/home/jenkins/agent, ...]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing shell script inside container [myContainer] of pod [jenkins-slave-ansible-update-adfs-mgmt-lfm45-02g27]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onOpen : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:17 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launching with env vars: [...]
Oct 01, 2018 1:20:19 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing command: "... my shell command ..."
Oct 01, 2018 1:20:19 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch
Created process inside pod: [jenkins-slave-ansible-update-xxxxx-xxxxx], container: [myContainer] with pid:[-1]
Oct 01, 2018 1:20:44 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onClose : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:45 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxxx-xxxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskCompleted
...
Looking at the plugin source code didn't help. We don't know the job hangs and Jenkins is not able to carry on with the executor. Do you have any suggestion how it could be fixed or at least investigated deeply? |