On 23.09.2013 07:27, Mladen Turk wrote:
> On 09/22/2013 03:39 PM, Rainer Jung wrote:
>> On 22.09.2013 13:17, Rainer Jung wrote:
>>> I debugged around my occasional failures for TestCoyoteAdapter when
>>> using APR.
>>>
>>> Error is:
>>>
>>> SEVERE [http-apr-127.0.0.1-auto-13-Poller]
>>> org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
>>> error [81] : [File descriptor in bad state]
>>
>> ...
>>
>>
>>> Not sure whether the problem is more in the concurrent poll plus remove,
>>> or the following code in poll.c:
>>>
>>> TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
>>>                                         jlong socket)
>>> {
>>>      apr_pollfd_t fd;
>>>      apr_status_t rv;
>>>      tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
>>>      tcn_socket_t  *s = J2P(socket, tcn_socket_t *);
>>>
>>>      UNREFERENCED_STDARGS;
>>>      TCN_ASSERT(socket != 0);
>>>
>>>      if (s->pe == NULL) {
>>>          /* Already removed */
>>>          return APR_SUCCESS;
>>>      }
>>>
>>> Here we return APR_SUCCESS and the code calling Poll.remove in
>>> AprEndpoint always does:
>>>
>>>                      rv = Poll.remove(pollers[i], socket);
>>>                      if (rv != Status.APR_NOTFOUND) {
>>>                          pollerSpace[i]++;
>>>                          connectionCount--;
>>>                          break;
>>>                      }
>>>
>>> So the pollerSpace and connectionCount numbers are (in/de)cremented.
>>
>> The following patch seems to fix it for me, at least 150 test runs for
>> TestCoyoteAdapter were successful:
>>
>> Index: ../native/branches/1.1.x/native/src/poll.c
>> ===================================================================
>> --- ../native/branches/1.1.x/native/src/poll.c  (revision 1525348)
>> +++ ../native/branches/1.1.x/native/src/poll.c  (working copy)
>> @@ -259,7 +259,7 @@
>>
>>       if (s->pe == NULL) {
>>           /* Already removed */
>> -        return APR_SUCCESS;
>> +        return APR_NOTFOUND;
>>       }
>>       fd.desc_type   = APR_POLL_SOCKET;
>>       fd.desc.s      = s->sock;
>>
>>
> 
> The patch seems fine. I mean any return value should do in theory.
> The main question is why is particular socket removed twice
> from the Poller. This is called directly from java code
> so wrapper seems to call it twice (or more).
> I suspect that the socket is first closed and then Poller loop removes it.
> Or it can be removed by poll with doRemove == true or during pollset
> maintain.
> In any case after removed either by poll or maintain returned set of
> removed
> sockets must be invalidated from pollset so it doesn't use it again in
> explicit remove.

I agree that there's probably another problem further up the stack.
Since the apr endpoint explicitely uses the return value of remove to
decide whether something was removed, APT_NOTFOUND seems better. But as
you said: why is it calling remove for a socket not in the poller? When
the endpoint called remove for "2856392", the poller has size one and
consisted only of socket 2218784. Strange.

Here's the more complete log snippet:

SOCKET-1 exists, but will be destroyed before the problem happen,
SOCKET-2 s going to be accepted and later produces the problem,
SOCKET-3 comes into the picture only shortly before the problem happens.


12:19:12.844 FINE [13-exec-4] AprEndpoint$Poller.add socket [SOCKET-1],
timeout [3,000], flags [1]

Here 2 is acceppted

12:19:12.850 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions
socket [SOCKET-2]

and added to the poller

12:19:12.853 FINE [13-exec-5] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.858 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-1]
12:19:12.860 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-1
12:19:12.861 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Here adding to the poller actually happens

12:19:12.862 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.863 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 1 socket SOCKET-2
12:19:12.864 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now the poller contains Sockets 1 and 2 and gets polled

12:19:12.865 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 2 with timeout 2000
12:19:12.866 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 2

It returns both sockets: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8)
 = 2 [0]

12:19:12.867 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-1] for event(s) [1]
12:19:12.869 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

Socket 1 gets destroyed.

12:19:12.871 FINE [13-exec-6] AprEndpoint.destroySocket socket
[SOCKET-1], doIt [true]

Socket 2 is handled by an exec thread which adds it back to the poller:
port_associate(6, 4, 0x0000000D, 0x00000001, 0x0089AB88) = 0

12:19:12.878 FINE [13-exec-7] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.880 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.881 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-2
12:19:12.882 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now again polling the poller that now only contains socket 2

12:19:12.883 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000
12:19:12.884 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 1

Socket 2 is returned from the polling: port_getn(6, 0x005D0CF8, 8192, 1,
0xB417F6F8)   = 1 [0]

12:19:12.885 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

exec thread 8 handles it and adds it back to the poller

12:19:12.890 FINE [13-exec-8] AprEndpoint$Poller.add socket [SOCKET-2],
timeout [3,000], flags [1]
12:19:12.892 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-2]
12:19:12.893 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-2

Before Socket 2 has been completely added to the poller another socket
(3) gets accepted: accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT)  = 16

12:19:12.898 FINE [13-Acceptor-0] AprEndpoint.processSocketWithOptions
socket [SOCKET-3]

exec thread 9 wants to add socket 3 to the poller:

12:19:12.901 FINE [13-exec-9] AprEndpoint$Poller.add socket [SOCKET-3],
timeout [3,000], flags [1]
12:19:12.894 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Now poller has size 1 and contains socket 2: port_associate(6, 4,
0x0000000D, 0x00000001, 0x0089ABE8) = 0


The poller gets polled

12:19:12.903 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000
12:19:12.905 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 returned with 1
12:19:12.906 FINE [13-Poller] AprEndpoint$Poller.run Processing socket
[SOCKET-2] for event(s) [1]

polling returns socket 2, size of poller is again 0, socket 3 waits to
be added: port_getn(6, 0x005D0CF8, 8192, 1, 0xB417F6F8)   = 1 [0]

12:19:12.909 FINE [13-Poller] AprEndpoint$Poller.run socket [SOCKET-3]
12:19:12.910 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 of size 0 socket SOCKET-3
12:19:12.911 WARNING [13-Poller] AprEndpoint$Poller.addToPoller Adding
to poller number 0 returned with 0

Socket 3 is added to the poller, has now size 1: port_associate(6, 4,
0x00000010, 0x00000001, 0x0089AB88) = 0

Poller gets polled

12:19:12.912 WARNING [13-Poller] AprEndpoint$Poller.run Polling poller
number 0 of size 1 with timeout 2000

exec thread 10 starts to work on socket 2 returned by the earlier poll,
but then exec thread 10 immediately calls remove for socket 2. That's
the problem. The socket was returned by a poll, why calling remove?

12:19:12.914 WARNING [13-exec-10] AprEndpoint$Poller.removeFromPoller
Removing from poller number 0 of size 1 socket SOCKET-2

Here's what happens natively in the exec thread:

19372/141:      pollsys(0xB317EFC0, 1, 0xB317EF58, 0x00000000)  = 1
19372/141:              fd=13 ev=POLLIN rev=POLLIN
19372/141:              timeout: 3.000000000 sec
19372/141:      read(13, 0x00219468, 8192)                      = 0

It polls the socket with a timeout of 3 seconds, then reads from it and
gets 0 bytes. To me this looks like a SocketProcessor is called with
OPEN_READ, which calls handler.process with OPEN_READ. This finally gets
handled in the processor. I don't know what the expected behavior for
process() with starting state OPEN_READ is, if the read returns 0 bytes.
Somehow there seems to be an error condition triggered, even by the 0
bytes or a timeout. But a timeout is unlikely, because everything takes
only a few milliseconds.

The error condition leads to a call of destroySocket() which in turn
calls poller.removeFromPoller().

So:

- since removeFromPoller() is part of error condition handling, and its
return code is used to determine adjusting the externally tracked size
of the poller and the connection counts, the return code must be
"correct". It is still open to me, whether returning APR_NOTFOUND is a
problem for the conection count tracking.

- we should check, whethera 0 byte read after OPEN_READ really is a
problem or must be expected even for a non-broken socket.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to