I debugged around my occasional failures for TestCoyoteAdapter when using APR.
Error is: SEVERE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with error [81] : [File descriptor in bad state] In the native world it is: 19372/128: port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8) Err#81 EBADFD The bad FD here is 0xFFFFFFFF. Normally the FDs put into this slot of port_associate during the test are small numbers between 6 and 20 (decimal). Here's a -1 and we must have an error before. Wat we would have expected is FD 11. In the truss snippet, Thread 2 is the client opening the connection, thread 130 accepts it and thread 133 starts to work on it: 19372/2: so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "", SOV_DEFAULT) = 8 Connect 19372/2: connect(8, 0xFDF7DAE0, 16, SOV_DEFAULT) = 0 19372/2: AF_INET name = 127.0.0.1 port = 42932 Accept 19372/130: accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT) = 11 19372/130: AF_INET name = 127.0.0.1 port = 42936 19372/2: getsockname(8, 0xFDF7DAE0, 0xFDF7DA2C, SOV_DEFAULT) = 0 19372/2: AF_INET name = 127.0.0.1 port = 42936 19372/130: fcntl(11, F_GETFD, 0x00000003) = 0 19372/130: fcntl(11, F_SETFD, 0x00000001) = 0 19372/2: setsockopt(8, tcp, TCP_NODELAY, 0xFDF7DE10, 4, SOV_DEFAULT) = 0 19372/130: lwp_cond_signal(0x0031E3B0) = 0 19372/130: condvar type: USYNC_THREAD Send request 19372/2: send(8, " G E T / f o o ; j s e".., 227, 0) = 227 19372/13: lwp_cond_wait(0x0031E3B0, 0x0031E398, 0xB457F278, 0) = 0 19372/13: condvar type: USYNC_THREAD 19372/13: mutex type: USYNC_THREAD 19372/13: timeout: 0.999998773 sec 19372/130: lwp_cond_signal(0x004215F0) = 0 19372/130: condvar type: USYNC_THREAD 19372/133: lwp_cond_wait(0x004215F0, 0x004215D8, 0x00000000, 0) = 0 19372/133: condvar type: USYNC_THREAD 19372/133: mutex type: USYNC_THREAD 19372/133: setsockopt(11, tcp, TCP_NODELAY, 0xB3C7F40C, 4, SOV_DEFAULT) = 0 19372/133: fcntl(11, F_GETFL) = 2 FWRITE 19372/133: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0 19372/133: lwp_cond_signal(0x0031E3B0) = 0 19372/133: condvar type: USYNC_THREAD 19372/133: lwp_cond_signal(0x00274448) = 0 19372/133: condvar type: USYNC_THREAD 19372/128: lwp_cond_wait(0x00274448, 0x00274430, 0xB417F580, 0) = 0 19372/128: condvar type: USYNC_THREAD 19372/128: mutex type: USYNC_THREAD 19372/128: timeout: 9.999997356 sec 19372/128: lwp_cond_signal(0x0031E3B0) = 0 19372/128: condvar type: USYNC_THREAD 19372/128: lwp_cond_signal(0x0031E3B0) = 0 19372/128: condvar type: USYNC_THREAD Adding to the pollset with wrong FD -1 instead of 11. 19372/128: port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8) Err#81 EBADFD I added some Java log statements to track calls to the Poll class (please ignore that hey are log level warn): 22-Sep-2013 12:19:12.909 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784] 22-Sep-2013 12:19:12.910 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket 2218784 22-Sep-2013 12:19:12.911 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 Here begins the root cause: - start polling poller of size 1, containing only socket 2218784: 22-Sep-2013 12:19:12.912 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number 0 of size 1 with timeout 2000 - start removing from the same poller concurrently in the exec thread: 22-Sep-2013 12:19:12.914 WARNING [http-apr-127.0.0.1-auto-13-exec-10] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing from poller number 0 of size 1 socket 2856392 - polling returns with one fd and decreases size to 0: 22-Sep-2013 12:19:12.913 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number 0 returned with 1 - remove returns with success and decreases size to -1! : 22-Sep-2013 12:19:12.915 WARNING [http-apr-127.0.0.1-auto-13-exec-10] org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing from poller number 0 returned with 0 22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket [2,218,784] for event(s) [1] 22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-exec-10] org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,856,392], doIt [true] 22-Sep-2013 12:19:12.929 FINE [http-apr-127.0.0.1-auto-13-exec-10] org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,218,784], timeout [3,000], flags [1] 22-Sep-2013 12:19:12.930 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784] Here the poller data structure is corrupt: it says size is -1! The value printed is "actualPollerSize-pollerSpace[i]", so pollerSpace[i] was incremented once to many. 22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 of size -1 socket 2218784 22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 22-Sep-2013 12:19:22.935 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [2,218,784 22-Sep-2013 12:19:22.937 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,218,784], doIt [true] 22-Sep-2013 12:19:22.940 FINE [http-apr-127.0.0.1-auto-13-Acceptor-0] org.apache.tomcat.util.net.AprEndpoint.processSocketWithOptions socket [2,848,192] 22-Sep-2013 12:19:22.947 FINE [http-apr-127.0.0.1-auto-13-exec-2] org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,848,192], timeout [3,000], flags [1] 22-Sep-2013 12:19:22.948 FINE [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,848,192] 22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 of size 0 socket 2848192 22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to poller number 0 returned with 0 22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number 0 of size 1 with timeout 2000 Here we get EBADFD (-81 on Solaris): 22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller] org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number 0 returned with -81 Not sure whether the problem is more in the concurrent poll plus remove, or the following code in poll.c: TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset, jlong socket) { apr_pollfd_t fd; apr_status_t rv; tcn_pollset_t *p = J2P(pollset, tcn_pollset_t *); tcn_socket_t *s = J2P(socket, tcn_socket_t *); UNREFERENCED_STDARGS; TCN_ASSERT(socket != 0); if (s->pe == NULL) { /* Already removed */ return APR_SUCCESS; } Here we return APR_SUCCESS and the code calling Poll.remove in AprEndpoint always does: rv = Poll.remove(pollers[i], socket); if (rv != Status.APR_NOTFOUND) { pollerSpace[i]++; connectionCount--; break; } So the pollerSpace and connectionCount numbers are (in/de)cremented. I don't have a real solution yet, because I'm not sure which of the two observations is the real culprit, concurrency or "wrong" return code. Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org