Thanks for the detailed log analysis. I agree with your order of events list, although there are some details that are still suspicious. For example, if the client has decided to shutdown the connection and that properly arrived at the server, we should see something other than "java.io.IOException: Operation timed out" in the read. And we don't set any read timeout, so I'm puzzled as to why we see this exception during the read.

In any case, according to that theory, the first event that caused this is the following:

Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)

This appears to be an error unique to Windows, and I found this stackoverflow conversation helpful to understand what it means, as well as this KB article.

If I understand this error correctly, it means that Jenkins slave has given WinSock some data to send, and WinSock has sent some packets over the network, but it's not getting the data through to the other side, and so it has given up and declared the connection lost. That happens much later after the write is called, so the next read that came in gets hit by this problem.

The links I mentioned above do not really discuss what that "data transmission time-out" is about. It could be anything from WinSock not getting TCK ACK from the other side, TCP flow control, or self-imposed timeout unique to WinSock.

I'd have normally suspected the network issue, but the succssive reconnection attempts seem to exclude this possibility.

Another possibility could be that this is related to issues like JENKINS-24050, where an unrelated problem kills the NIO selector thread. When that happens, all the sockets stopped getting serviced, so data that the client sent will not be picked up by the application on the server side. TCP ACK should still come in this case, since kernel is getting the packets all right, but perhaps in this situation WinSock might decide to time out, perhaps due to a self-imposed timeout. This would need some experiments.

I've just fixed JENKINS-24050, so if possible I'd love to have you try this fix, which should make it into 1.580. Another useful test is to disable the use of NIO on the server side for managing JNLP slaves with the system property '-Djenkins.slaves.NioChannelSelector.disabled=true' on the master. If the problem goes away with it, that'd be an useful input.

For anyone else seeing problems, please check relevant logs on both the server and the slave and share it with us, so that we can resolve this problem efficiently.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to