On 23.09.2013 07:27, Mladen Turk wrote: > On 09/22/2013 03:39 PM, Rainer Jung wrote: >> On 22.09.2013 13:17, Rainer Jung wrote: >>> I debugged around my occasional failures for TestCoyoteAdapter when >>> using APR. >>> >>> Error is: >>> >>> SEVERE [http-apr-127.0.0.1-auto-13-Poller] >>> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with >>> error [81] : [File descriptor in bad state] >> >> ... >> >> >>> Not sure whether the problem is more in the concurrent poll plus remove, >>> or the following code in poll.c: >>> >>> TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset, >>> jlong socket) >>> { >>> apr_pollfd_t fd; >>> apr_status_t rv; >>> tcn_pollset_t *p = J2P(pollset, tcn_pollset_t *); >>> tcn_socket_t *s = J2P(socket, tcn_socket_t *); >>> >>> UNREFERENCED_STDARGS; >>> TCN_ASSERT(socket != 0); >>> >>> if (s->pe == NULL) { >>> /* Already removed */ >>> return APR_SUCCESS; >>> } >>> >>> Here we return APR_SUCCESS and the code calling Poll.remove in >>> AprEndpoint always does: >>> >>> rv = Poll.remove(pollers[i], socket); >>> if (rv != Status.APR_NOTFOUND) { >>> pollerSpace[i]++; >>> connectionCount--; >>> break; >>> } >>> >>> So the pollerSpace and connectionCount numbers are (in/de)cremented. >> >> The following patch seems to fix it for me, at least 150 test runs for >> TestCoyoteAdapter were successful: >> >> Index: ../native/branches/1.1.x/native/src/poll.c >> =================================================================== >> --- ../native/branches/1.1.x/native/src/poll.c (revision 1525348) >> +++ ../native/branches/1.1.x/native/src/poll.c (working copy) >> @@ -259,7 +259,7 @@ >> >> if (s->pe == NULL) { >> /* Already removed */ >> - return APR_SUCCESS; >> + return APR_NOTFOUND; >> } >> fd.desc_type = APR_POLL_SOCKET; >> fd.desc.s = s->sock; >> >> > > The patch seems fine. I mean any return value should do in theory. > The main question is why is particular socket removed twice > from the Poller. This is called directly from java code > so wrapper seems to call it twice (or more). > I suspect that the socket is first closed and then Poller loop removes it. > Or it can be removed by poll with doRemove == true or during pollset > maintain. > In any case after removed either by poll or maintain returned set of > removed > sockets must be invalidated from pollset so it doesn't use it again in > explicit remove.
I agree that there's probably another problem further up the stack. Since the apr endpoint explicitely uses the return value of remove to decide whether something was removed, APT_NOTFOUND seems better. But as you said: why is it calling remove for a socket not in the poller? When the endpoint called remove for "2856392", the poller has size one and consisted only of socket 2218784. Strange. Here's the more complete log snippet: SOCKET-1 exists, but will be destroyed before the problem happen, SOCKET-2 s going to be accepted and later produces the problem, SOCKET-3 comes into the picture only shortly before the problem happens. 12:19:12.844 FINE [13-exec-4] AprEndpoint$Poller.add socket [SOCKET-1], timeout [3,000], flags [1] Here 2 is acceppted 12:19:12.850 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions socket [SOCKET-2] and added to the poller 12:19:12.853 FINE [13-exec-5] AprEndpoint$Poller.add socket [SOCKET-2], timeout [3,000], flags [1] 12:19:12.858 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-1] 12:19:12.860 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket SOCKET-1 12:19:12.861 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Here adding to the poller actually happens 12:19:12.862 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2] 12:19:12.863 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 1 socket SOCKET-2 12:19:12.864 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Now the poller contains Sockets 1 and 2 and gets polled 12:19:12.865 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 of size 2 with timeout 2000 12:19:12.866 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 returned with 2 It returns both sockets: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8) = 2 [0] 12:19:12.867 FINE [13-Poller] AprEndpoint$Poller.run Processing socket [SOCKET-1] for event(s) [1] 12:19:12.869 FINE [13-Poller] AprEndpoint$Poller.run Processing socket [SOCKET-2] for event(s) [1] Socket 1 gets destroyed. 12:19:12.871 FINE [13-exec-6] AprEndpoint.destroySocket socket [SOCKET-1], doIt [true] Socket 2 is handled by an exec thread which adds it back to the poller: port_associate(6, 4, 0x0000000D, 0x00000001, 0x0089AB88) = 0 12:19:12.878 FINE [13-exec-7] AprEndpoint$Poller.add socket [SOCKET-2], timeout [3,000], flags [1] 12:19:12.880 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2] 12:19:12.881 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket SOCKET-2 12:19:12.882 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Now again polling the poller that now only contains socket 2 12:19:12.883 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 of size 1 with timeout 2000 12:19:12.884 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 returned with 1 Socket 2 is returned from the polling: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8) = 1 [0] 12:19:12.885 FINE [13-Poller] AprEndpoint$Poller.run Processing socket [SOCKET-2] for event(s) [1] exec thread 8 handles it and adds it back to the poller 12:19:12.890 FINE [13-exec-8] AprEndpoint$Poller.add socket [SOCKET-2], timeout [3,000], flags [1] 12:19:12.892 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2] 12:19:12.893 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket SOCKET-2 Before Socket 2 has been completely added to the poller another socket (3) gets accepted: accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT) = 16 12:19:12.898 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions socket [SOCKET-3] exec thread 9 wants to add socket 3 to the poller: 12:19:12.901 FINE [13-exec-9] AprEndpoint$Poller.add socket [SOCKET-3], timeout [3,000], flags [1] 12:19:12.894 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Now poller has size 1 and contains socket 2: port_associate(6, 4, 0x0000000D, 0x00000001, 0x0089ABE8) = 0 The poller gets polled 12:19:12.903 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 of size 1 with timeout 2000 12:19:12.905 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 returned with 1 12:19:12.906 FINE [13-Poller] AprEndpoint$Poller.run Processing socket [SOCKET-2] for event(s) [1] polling returns socket 2, size of poller is again 0, socket 3 waits to be added: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8) = 1 [0] 12:19:12.909 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-3] 12:19:12.910 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket SOCKET-3 12:19:12.911 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Socket 3 is added to the poller, has now size 1: port_associate(6, 4, 0x00000010, 0x00000001, 0x0089AB88) = 0 Poller gets polled 12:19:12.912 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller number 0 of size 1 with timeout 2000 exec thread 10 starts to work on socket 2 returned by the earlier poll, but then exec thread 10 immediately calls remove for socket 2. That's the problem. The socket was returned by a poll, why calling remove? 12:19:12.914 WARNING [13-exec-10] AprEndpoint$Poller.removeFromPoller Removing from poller number 0 of size 1 socket SOCKET-2 Here's what happens natively in the exec thread: 19372/141: pollsys(0xB317EFC0, 1, 0xB317EF58, 0x00000000) = 1 19372/141: fd=13 ev=POLLIN rev=POLLIN 19372/141: timeout: 3.000000000 sec 19372/141: read(13, 0x00219468, 8192) = 0 It polls the socket with a timeout of 3 seconds, then reads from it and gets 0 bytes. To me this looks like a SocketProcessor is called with OPEN_READ, which calls handler.process with OPEN_READ. This finally gets handled in the processor. I don't know what the expected behavior for process() with starting state OPEN_READ is, if the read returns 0 bytes. Somehow there seems to be an error condition triggered, even by the 0 bytes or a timeout. But a timeout is unlikely, because everything takes only a few milliseconds. The error condition leads to a call of destroySocket() which in turn calls poller.removeFromPoller(). So: - since removeFromPoller() is part of error condition handling, and its return code is used to determine adjusting the externally tracked size of the poller and the connection counts, the return code must be "correct". It is still open to me, whether returning APR_NOTFOUND is a problem for the conection count tracking. - we should check, whethera 0 byte read after OPEN_READ really is a problem or must be expected even for a non-broken socket. Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org