I debugged around my occasional failures for TestCoyoteAdapter when
using APR.

Error is:

SEVERE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
error [81] : [File descriptor in bad state]


In the native world it is:

19372/128:      port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD

The bad FD here is 0xFFFFFFFF. Normally the FDs put into this slot of
port_associate during the test are small numbers between 6 and 20
(decimal). Here's a -1 and we must have an error before. Wat we would
have expected is FD 11. In the truss snippet, Thread 2 is the client
opening the connection, thread 130 accepts it and thread 133 starts to
work on it:

19372/2:        so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "",
SOV_DEFAULT) = 8


Connect

19372/2:        connect(8, 0xFDF7DAE0, 16, SOV_DEFAULT)         = 0
19372/2:                AF_INET  name = 127.0.0.1  port = 42932

Accept

19372/130:      accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT)  = 11
19372/130:              AF_INET  name = 127.0.0.1  port = 42936
19372/2:        getsockname(8, 0xFDF7DAE0, 0xFDF7DA2C, SOV_DEFAULT) = 0
19372/2:                AF_INET  name = 127.0.0.1  port = 42936
19372/130:      fcntl(11, F_GETFD, 0x00000003)                  = 0
19372/130:      fcntl(11, F_SETFD, 0x00000001)                  = 0
19372/2:        setsockopt(8, tcp, TCP_NODELAY, 0xFDF7DE10, 4,
SOV_DEFAULT) = 0
19372/130:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/130:              condvar type: USYNC_THREAD

Send request

19372/2:        send(8, " G E T   / f o o ; j s e".., 227, 0)   = 227
19372/13:       lwp_cond_wait(0x0031E3B0, 0x0031E398, 0xB457F278, 0) = 0
19372/13:               condvar type: USYNC_THREAD
19372/13:               mutex type: USYNC_THREAD
19372/13:               timeout: 0.999998773 sec
19372/130:      lwp_cond_signal(0x004215F0)                     = 0
19372/130:              condvar type: USYNC_THREAD
19372/133:      lwp_cond_wait(0x004215F0, 0x004215D8, 0x00000000, 0) = 0
19372/133:              condvar type: USYNC_THREAD
19372/133:              mutex type: USYNC_THREAD
19372/133:      setsockopt(11, tcp, TCP_NODELAY, 0xB3C7F40C, 4,
SOV_DEFAULT) = 0
19372/133:      fcntl(11, F_GETFL)                              = 2
                FWRITE
19372/133:      fcntl(11, F_SETFL, FWRITE|FNONBLOCK)            = 0

19372/133:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/133:              condvar type: USYNC_THREAD
19372/133:      lwp_cond_signal(0x00274448)                     = 0
19372/133:              condvar type: USYNC_THREAD
19372/128:      lwp_cond_wait(0x00274448, 0x00274430, 0xB417F580, 0) = 0
19372/128:              condvar type: USYNC_THREAD
19372/128:              mutex type: USYNC_THREAD
19372/128:              timeout: 9.999997356 sec
19372/128:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/128:              condvar type: USYNC_THREAD
19372/128:      lwp_cond_signal(0x0031E3B0)                     = 0
19372/128:              condvar type: USYNC_THREAD

Adding to the pollset with wrong FD -1 instead of 11.

19372/128:      port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD

I added some Java log statements to track calls to the Poll class
(please ignore that hey are log level warn):

22-Sep-2013 12:19:12.909 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]
22-Sep-2013 12:19:12.910 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2218784
22-Sep-2013 12:19:12.911 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0

Here begins the root cause:

- start polling poller of size 1, containing only socket 2218784:

22-Sep-2013 12:19:12.912 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000

- start removing from the same poller concurrently in the exec thread:

22-Sep-2013 12:19:12.914 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 of size 1 socket 2856392

- polling returns with one fd and decreases size to 0:

22-Sep-2013 12:19:12.913 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with 1

- remove returns with success and decreases size to -1! :

22-Sep-2013 12:19:12.915 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 returned with 0


22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket
[2,218,784] for event(s) [1]
22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,856,392],
doIt [true]
22-Sep-2013 12:19:12.929 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,218,784],
timeout [3,000], flags [1]
22-Sep-2013 12:19:12.930 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]


Here the poller data structure is corrupt: it says size is -1!
The value printed is "actualPollerSize-pollerSpace[i]", so
pollerSpace[i] was incremented once to many.


22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size -1 socket 2218784
22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.935 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [2,218,784
22-Sep-2013 12:19:22.937 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,218,784],
doIt [true]
22-Sep-2013 12:19:22.940 FINE [http-apr-127.0.0.1-auto-13-Acceptor-0]
org.apache.tomcat.util.net.AprEndpoint.processSocketWithOptions socket
[2,848,192]
22-Sep-2013 12:19:22.947 FINE [http-apr-127.0.0.1-auto-13-exec-2]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,848,192],
timeout [3,000], flags [1]
22-Sep-2013 12:19:22.948 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,848,192]
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2848192
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000

Here we get EBADFD (-81 on Solaris):

22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with -81


Not sure whether the problem is more in the concurrent poll plus remove,
or the following code in poll.c:

TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
                                       jlong socket)
{
    apr_pollfd_t fd;
    apr_status_t rv;
    tcn_pollset_t *p = J2P(pollset,  tcn_pollset_t *);
    tcn_socket_t  *s = J2P(socket, tcn_socket_t *);

    UNREFERENCED_STDARGS;
    TCN_ASSERT(socket != 0);

    if (s->pe == NULL) {
        /* Already removed */
        return APR_SUCCESS;
    }

Here we return APR_SUCCESS and the code calling Poll.remove in
AprEndpoint always does:

                    rv = Poll.remove(pollers[i], socket);
                    if (rv != Status.APR_NOTFOUND) {
                        pollerSpace[i]++;
                        connectionCount--;
                        break;
                    }

So the pollerSpace and connectionCount numbers are (in/de)cremented.

I don't have a real solution yet, because I'm not sure which of the two
observations is the real culprit, concurrency or "wrong" return code.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to