Hi all, I am trying to debug an intermittent failure with Remoting. I am familiar with the overall design of Remoting but not its implementation. I'm not sure if I should file a bug or not, so I am asking for help debugging here first. Please let me know if I should file a bug instead. Below I will present the data I have, and the questions I am trying to answer. I would appreciate any help debugging from someone who is familiar with the Remoting code.
My Jenkins master is running Jenkins 2.89.3 LTS (Remoting 3.14). My Jenkins agent is connecting to the master via Swarm Plugin 3.8 (Remoting 3.16). Also note that the master is not in the same local network as the swarm client (the swarm client is running in AWS, the master is in a colo). The swarm client connects to the master successfully but then very quickly the job fails while trying to do a Git checkout: 15:31:47 Running on swarm-client-agent.example.com in /var/tmp/jenkins/workspace/path/to/workspace [Pipeline] { [Pipeline] ws 15:31:47 Running in /export/home/path/to/ws [Pipeline] { [Pipeline] stage [Pipeline] { (Check out repository) [Pipeline] checkout 15:32:42 > git rev-parse --is-inside-work-tree # timeout=10 [Pipeline] } 15:36:35 ERROR: Execution failed 15:36:35 java.nio.channels.ClosedChannelException 15:36:35 Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.110.xyz.xyz/10.110.xyz.xyz:44291 15:36:35 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693) 15:36:35 at hudson.remoting.Request.call(Request.java:192) 15:36:35 at hudson.remoting.Channel.call(Channel.java:907) 15:36:35 at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281) 15:36:35 at com.sun.proxy.$Proxy92.hasGitRepo(Unknown Source) 15:36:35 at org.jenkinsci.plugins.gitclient.RemoteGitImpl.hasGitRepo(RemoteGitImpl.java:280) 15:36:35 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1068) 15:36:35 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123) 15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:113) 15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:85) 15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:75) 15:36:35 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47) 15:36:35 at hudson.security.ACL.impersonate(ACL.java:260) 15:36:35 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44) 15:36:35 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 15:36:35 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 15:36:35 Caused: hudson.remoting.RequestAbortedException 15:36:35 at hudson.remoting.Request.abort(Request.java:329) 15:36:35 at hudson.remoting.Channel.terminate(Channel.java:992) 15:36:35 at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208) 15:36:35 at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222) 15:36:35 at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) 15:36:35 at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287) 15:36:35 at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172) 15:36:35 at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832) 15:36:35 at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) 15:36:35 at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) 15:36:35 at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) 15:36:35 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) 15:36:35 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 15:36:35 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 15:36:35 at java.lang.Thread.run(Thread.java:748) Since this was a swarm client and thus ephemeral, it was removed from the nodes view on the master and I have no access to any other logs on the master side. On the swarm client side, I have much more detailed logging. It looks like it definitely connected properly to the master (which makes sense, because the master entered the Git checkout stage): 2018-02-01 23:30:51 INFO hudson.plugins.swarm.Client main Client.main invoked with: [-name swarm-client-agent.example.com -fsroot /var/tmp/jenkins -master http://jenkins-master.example.com -username <redacted> -password <redacted> -executors 1 -description Dynamically configured Jenkins swarm plugin slave swarm-client-agent.example.com -mode exclusive -deleteExistingClients -disableClientsUniqueId -disableSslVerification] 2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient <init> SwarmClient constructed with hudson.plugins.swarm.Options@6f94fa3e 2018-02-01 23:30:52 INFO hudson.plugins.swarm.Client run Discovering Jenkins master 2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl discoverFromMasterUrl() invoked 2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to configure swarm client. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 2018-02-01 23:30:55 INFO hudson.plugins.swarm.Client run Attempting to connect to http://jenkins-master.example.com/ 17e24806-9248-4b9b-86aa-ce97caa34001 with ID 2018-02-01 23:30:55 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: swarm-client-agent.example.com 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode. 2018-02-01 23:30:56 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://jenkins-master.example.com/] 2018-02-01 23:30:56 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping] 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Agent discovery successful Agent address: jenkins-master.example.com Agent port: 53000 Identity: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Handshaking 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Connecting to jenkins-master.example.com:53000 2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Trying protocol: JNLP4-connect 2018-02-01 23:30:57 INFO hudson.remoting.jnlp.Main$CuiListener status Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5 2018-02-01 23:30:58 INFO hudson.remoting.jnlp.Main$CuiListener status Connected 2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.development => null 2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.timeout => null 2018-02-01 23:31:56 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.ProxyConfiguration.DEFAULT_CONNECT_TIMEOUT_MILLIS => null 2018-02-01 23:32:33 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.util.Secret.provider => null 2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.noSymLink => null 2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.symlinkEscapeHatch => null 2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.maxFileDeletionRetries => null 2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.deletionRetryWait => null 2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.performGCOnFailedDelete => null One minute later, disaster strikes: 2018-02-01 23:34:38 FINER org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Unexpected I/O exception java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:245) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at hudson.remoting.Engine$1$1.run(Engine.java:94) at java.lang.Thread.run(Thread.java:748) 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.NetworkLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed 2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed 2018-02-01 23:34:38 INFO hudson.remoting.jnlp.Main$CuiListener status Terminated 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.IOHub$IOHubSelectorWatcher run Windows IOHub Watcher for IOHub#1: Selector[keys:0, gen:0] / pool-1-thread-2: Finished 2018-02-01 23:34:38 FINE hudson.remoting.jnlp.Main main Engine has died 2018-02-01 23:34:38 INFO hudson.plugins.swarm.Client run Retrying in 10 seconds 2018-02-01 23:34:38 WARNING hudson.remoting.UserRequest perform LinkageError while performing UserRequest:UserRPCRequest(27,hasGitRepo) java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor at hudson.util.StreamTaskListener._error(StreamTaskListener.java:166) at hudson.util.StreamTaskListener.error(StreamTaskListener.java:175) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.hasGitRepo(CliGitAPIImpl.java:259) at hudson.plugins.git.GitAPI.hasGitRepo(GitAPI.java:232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:922) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:896) at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853) at hudson.remoting.UserRequest.perform(UserRequest.java:210) at hudson.remoting.UserRequest.perform(UserRequest.java:53) at hudson.remoting.Request$2.run(Request.java:358) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at hudson.remoting.Engine$1$1.run(Engine.java:94) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: hudson.console.ConsoleAnnotationDescriptor at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 20 more 2018-02-01 23:34:38 FINE hudson.slaves.ChannelPinger$2 onClosed Terminating ping thread for JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000 2018-02-01 23:34:38 FINE hudson.remoting.PingThread run Ping thread for channel hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000 is interrupted. Terminating 2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer doCloseWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] APP ENCODE: 0 bytes 2018-02-01 23:34:38 WARNING hudson.remoting.Request$2 run Failed to send back a reply to the request hudson.remoting.Request$2@2d486301 hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000": channel is already closed at hudson.remoting.Channel.send(Channel.java:672) at hudson.remoting.Request$2.run(Request.java:371) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at hudson.remoting.Engine$1$1.run(Engine.java:94) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1800(BIONetworkLayer.java:48) at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:264) ... 4 more 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Handshake status: NOT_HANDSHAKING engine result: Status = CLOSED HandshakeStatus = NOT_HANDSHAKING bytesConsumed = 0 bytesProduced = 85 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer doCloseSend [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.BIONetworkLayer doCloseSend [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND 2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Reader thread stopped 2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl discoverFromMasterUrl() invoked 2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to configure swarm client. Two minutes later, the swarm client manages to reconnect back to the master, but it is too late, because the job has already failed: 2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked 2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient Setting HttpClient credentials based on options passed 2018-02-01 23:36:47 FINER hudson.plugins.swarm.SwarmClient getChildElementString getChildElementString() invoked 2018-02-01 23:36:47 FINE hudson.plugins.swarm.Candidate <init> Candidate constructed with url: http://jenkins-master.example.com/, secret: 17e24806-9248-4b9b-86aa-ce97caa34001 2018-02-01 23:36:47 INFO hudson.plugins.swarm.Client run Attempting to connect to http://jenkins-master.example.com/ 17e24806-9248-4b9b-86aa-ce97caa34001 with ID 2018-02-01 23:36:51 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: swarm-client-agent.example.com 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode. 2018-02-01 23:36:57 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://jenkins-master.example.com/] 2018-02-01 23:36:57 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping] 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Agent discovery successful Agent address: jenkins-master.example.com Agent port: 53000 Identity: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Handshaking 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Connecting to jenkins-master.example.com:53000 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Trying protocol: JNLP4-connect 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5 2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Connected 2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.development => null 2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.timeout => null I also have FINE and FINEST level logs for the rest of the time on the swarm client side, but I didn’t attach them to this email because they don’t seem relevant. My initial questions: 1. From a timing perspective, the first error is at 2018-02-01 23:34:38 when the swarm client gets an "Unexpected I/O exception java.io.IOException: Connection timed out" when reading from the master. I then see a bunch of "onRecvClosed" log statements on the swarm client followed by "Terminated" and then attempts to reconnect to the master. The swarm client is in AWS and the master is in a local colo, so it doesn't surprise me that network connections occasionally time out. But why does the master kill the job with java.nio.channels.ClosedChannelException rather than waiting for the client to reconnect? The swarm client did in fact reconnect successfully (its logs show this), but by then the job had died on the master. 2. Is the "hudson.remoting.UserRequest perform LinkageError while performing UserRequest:UserRPCRequest(27,hasGitRepo) java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor" a red herring or a relevant error? Should I be exploring this further? As far as I know, I have all relevant plugins installed and up-to-date. 3. Who closed the connection first here, the client or the Jenkins master? If the problem is indeed being exposed by my network, where should I begin looking for dropped packets or congestion? Thanks in advance for any help, Basil -- You received this message because you are subscribed to the Google Groups "Jenkins Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/8fb4be31-95a7-441d-ad98-8cc8318324ad%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.