Hi all,
I am trying to debug an intermittent failure with Remoting. I am familiar
with the overall design of Remoting but not its implementation. I'm not
sure if I should file a bug or not, so I am asking for help debugging here
first. Please let me know if I should file a bug instead. Below I will
present the data I have, and the questions I am trying to answer. I would
appreciate any help debugging from someone who is familiar with the
Remoting code.
My Jenkins master is running Jenkins 2.89.3 LTS (Remoting 3.14). My Jenkins
agent is connecting to the master via Swarm Plugin 3.8 (Remoting 3.16).
Also note that the master is not in the same local network as the swarm
client (the swarm client is running in AWS, the master is in a colo).
The swarm client connects to the master successfully but then very quickly
the job fails while trying to do a Git checkout:
15:31:47 Running on swarm-client-agent.example.com in
/var/tmp/jenkins/workspace/path/to/workspace
[Pipeline] {
[Pipeline] ws
15:31:47 Running in /export/home/path/to/ws
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Check out repository)
[Pipeline] checkout
15:32:42 > git rev-parse --is-inside-work-tree # timeout=10
[Pipeline] }
15:36:35 ERROR: Execution failed
15:36:35 java.nio.channels.ClosedChannelException
15:36:35 Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to
JNLP4-connect connection from 10.110.xyz.xyz/10.110.xyz.xyz:44291
15:36:35 at
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693)
15:36:35 at hudson.remoting.Request.call(Request.java:192)
15:36:35 at hudson.remoting.Channel.call(Channel.java:907)
15:36:35 at
hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
15:36:35 at com.sun.proxy.$Proxy92.hasGitRepo(Unknown Source)
15:36:35 at
org.jenkinsci.plugins.gitclient.RemoteGitImpl.hasGitRepo(RemoteGitImpl.java:280)
15:36:35 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1068)
15:36:35 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
15:36:35 at
org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:113)
15:36:35 at
org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:85)
15:36:35 at
org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:75)
15:36:35 at
org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47)
15:36:35 at hudson.security.ACL.impersonate(ACL.java:260)
15:36:35 at
org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44)
15:36:35 at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
15:36:35 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
15:36:35 Caused: hudson.remoting.RequestAbortedException
15:36:35 at hudson.remoting.Request.abort(Request.java:329)
15:36:35 at hudson.remoting.Channel.terminate(Channel.java:992)
15:36:35 at
org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208)
15:36:35 at
org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
15:36:35 at
org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
15:36:35 at
org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
15:36:35 at
org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
15:36:35 at
org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
15:36:35 at
org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
15:36:35 at
org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
15:36:35 at
org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
15:36:35 at
jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
15:36:35 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
15:36:35 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
15:36:35 at java.lang.Thread.run(Thread.java:748)
Since this was a swarm client and thus ephemeral, it was removed from the
nodes view on the master and I have no access to any other logs on the
master side.
On the swarm client side, I have much more detailed logging. It looks like
it definitely connected properly to the master (which makes sense, because
the master entered the Git checkout stage):
2018-02-01 23:30:51 INFO hudson.plugins.swarm.Client main Client.main
invoked with: [-name swarm-client-agent.example.com -fsroot
/var/tmp/jenkins -master http://jenkins-master.example.com -username
<redacted> -password <redacted> -executors 1 -description Dynamically
configured Jenkins swarm plugin slave swarm-client-agent.example.com -mode
exclusive -deleteExistingClients -disableClientsUniqueId
-disableSslVerification]
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient <init>
SwarmClient constructed with hudson.plugins.swarm.Options@6f94fa3e
2018-02-01 23:30:52 INFO hudson.plugins.swarm.Client run Discovering
Jenkins master
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient
discoverFromMasterUrl discoverFromMasterUrl() invoked
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient
discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to
configure swarm client.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
2018-02-01 23:30:55 INFO hudson.plugins.swarm.Client run Attempting to
connect to http://jenkins-master.example.com/
17e24806-9248-4b9b-86aa-ce97caa34001 with ID
2018-02-01 23:30:55 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb
Could not obtain CSRF crumb. Response code: 404
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main createEngine Setting up
agent: swarm-client-agent.example.com
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener <init>
Jenkins agent is running in headless mode.
2018-02-01 23:30:56 WARNING hudson.remoting.Engine startEngine No Working
Directory. Using the legacy JAR Cache location:
/var/tmp/jenkins/.jenkins/cache/jars
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status
Locating server among [http://jenkins-master.example.com/]
2018-02-01 23:30:56 INFO
org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting
server accepts the following protocols: [JNLP4-connect, Ping]
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Agent
discovery successful
Agent address: jenkins-master.example.com
Agent port: 53000
Identity: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status
Handshaking
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status
Connecting to jenkins-master.example.com:53000
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status
Trying protocol: JNLP4-connect
2018-02-01 23:30:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:30:58 INFO hudson.remoting.jnlp.Main$CuiListener status
Connected
2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Main.development => null
2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Main.timeout => null
2018-02-01 23:31:56 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.ProxyConfiguration.DEFAULT_CONNECT_TIMEOUT_MILLIS =>
null
2018-02-01 23:32:33 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.util.Secret.provider => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Util.noSymLink => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Util.symlinkEscapeHatch => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Util.maxFileDeletionRetries => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Util.deletionRetryWait => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Util.performGCOnFailedDelete => null
One minute later, disaster strikes:
2018-02-01 23:34:38 FINER
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
[JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Unexpected I/O exception
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:245)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.NetworkLayer
onRecvClosed [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer
onRecvClosed [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer
onRecvClosed [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 INFO hudson.remoting.jnlp.Main$CuiListener status
Terminated
2018-02-01 23:34:38 FINEST
org.jenkinsci.remoting.protocol.IOHub$IOHubSelectorWatcher run Windows
IOHub Watcher for IOHub#1: Selector[keys:0, gen:0] / pool-1-thread-2:
Finished
2018-02-01 23:34:38 FINE hudson.remoting.jnlp.Main main Engine has died
2018-02-01 23:34:38 INFO hudson.plugins.swarm.Client run Retrying in 10
seconds
2018-02-01 23:34:38 WARNING hudson.remoting.UserRequest perform
LinkageError while performing UserRequest:UserRPCRequest(27,hasGitRepo)
java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor
at hudson.util.StreamTaskListener._error(StreamTaskListener.java:166)
at hudson.util.StreamTaskListener.error(StreamTaskListener.java:175)
at
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.hasGitRepo(CliGitAPIImpl.java:259)
at hudson.plugins.git.GitAPI.hasGitRepo(GitAPI.java:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:922)
at
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:896)
at
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:210)
at hudson.remoting.UserRequest.perform(UserRequest.java:53)
at hudson.remoting.Request$2.run(Request.java:358)
at
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException:
hudson.console.ConsoleAnnotationDescriptor
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more
2018-02-01 23:34:38 FINE hudson.slaves.ChannelPinger$2 onClosed Terminating
ping thread for JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000
2018-02-01 23:34:38 FINE hudson.remoting.PingThread run Ping thread for
channel hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000 is interrupted. Terminating
2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer
doCloseWrite [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST
org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite
[JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] APP ENCODE: 0 bytes
2018-02-01 23:34:38 WARNING hudson.remoting.Request$2 run Failed to send
back a reply to the request hudson.remoting.Request$2@2d486301
hudson.remoting.ChannelClosedException: Channel
"hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000": channel is already closed
at hudson.remoting.Channel.send(Channel.java:672)
at hudson.remoting.Request$2.run(Request.java:371)
at
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at
org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1800(BIONetworkLayer.java:48)
at
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:264)
... 4 more
2018-02-01 23:34:38 FINEST
org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite
[JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Handshake status:
NOT_HANDSHAKING engine result: Status = CLOSED HandshakeStatus =
NOT_HANDSHAKING
bytesConsumed = 0 bytesProduced = 85
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer
onRecvClosed [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer
doCloseSend [JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer doCloseSend
[JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST
org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
[JNLP4-connect connection to
jenkins-master.example.com/172.16.106.38:53000] Reader thread stopped
2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient
discoverFromMasterUrl discoverFromMasterUrl() invoked
2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient
discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to
configure swarm client.
Two minutes later, the swarm client manages to reconnect back to the
master, but it is too late, because the job has already failed:
2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient
createHttpClient() invoked
2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient
Setting HttpClient credentials based on options passed
2018-02-01 23:36:47 FINER hudson.plugins.swarm.SwarmClient
getChildElementString getChildElementString() invoked
2018-02-01 23:36:47 FINE hudson.plugins.swarm.Candidate <init> Candidate
constructed with url: http://jenkins-master.example.com/, secret:
17e24806-9248-4b9b-86aa-ce97caa34001
2018-02-01 23:36:47 INFO hudson.plugins.swarm.Client run Attempting to
connect to http://jenkins-master.example.com/
17e24806-9248-4b9b-86aa-ce97caa34001 with ID
2018-02-01 23:36:51 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb
Could not obtain CSRF crumb. Response code: 404
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main createEngine Setting up
agent: swarm-client-agent.example.com
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener <init>
Jenkins agent is running in headless mode.
2018-02-01 23:36:57 WARNING hudson.remoting.Engine startEngine No Working
Directory. Using the legacy JAR Cache location:
/var/tmp/jenkins/.jenkins/cache/jars
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Locating server among [http://jenkins-master.example.com/]
2018-02-01 23:36:57 INFO
org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting
server accepts the following protocols: [JNLP4-connect, Ping]
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Agent
discovery successful
Agent address: jenkins-master.example.com
Agent port: 53000
Identity: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Handshaking
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Connecting to jenkins-master.example.com:53000
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Trying protocol: JNLP4-connect
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status
Connected
2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Main.development => null
2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property
(not found): hudson.Main.timeout => null
I also have FINE and FINEST level logs for the rest of the time on the
swarm client side, but I didn’t attach them to this email because they
don’t seem relevant.
My initial questions:
1. From a timing perspective, the first error is at 2018-02-01 23:34:38
when the swarm client gets an "Unexpected I/O exception
java.io.IOException: Connection timed out" when reading from the master. I
then see a bunch of "onRecvClosed" log statements on the swarm client
followed by "Terminated" and then attempts to reconnect to the master. The
swarm client is in AWS and the master is in a local colo, so it doesn't
surprise me that network connections occasionally time out. But why does
the master kill the job with java.nio.channels.ClosedChannelException
rather than waiting for the client to reconnect? The swarm client did in
fact reconnect successfully (its logs show this), but by then the job had
died on the master.
2. Is the "hudson.remoting.UserRequest perform LinkageError while
performing UserRequest:UserRPCRequest(27,hasGitRepo)
java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor"
a red herring or a relevant error? Should I be exploring this further? As
far as I know, I have all relevant plugins installed and up-to-date.
3. Who closed the connection first here, the client or the Jenkins
master? If the problem is indeed being exposed by my network, where should
I begin looking for dropped packets or congestion?
Thanks in advance for any help,
Basil
--
You received this message because you are subscribed to the Google Groups
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jenkinsci-dev/8fb4be31-95a7-441d-ad98-8cc8318324ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.