Re: PR 15282 AcceptEx problem
William A. Rowe, Jr. wrote: Just to summarize, there are three conditions we need to consider: 1) we hit the TransmitFile recycle bug many times in a row 2) we have encountered an incompatible firewall or VPN 3) the IP address has changed You seem to have the failcases easily reproduced. Would you tack in some quick code that simply uses getsockopt(foo) (any option you like) to see if simply getting socket options for a now-broken listen socket will fail? Actually I have not been able to reproduce the AcceptEx error for 3), however I think the following will address all three cases and introduces the WindowsSocketsWorkaround directive: Index: mpm/winnt/child.c === RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v retrieving revision 1.13 diff -u -d -b -r1.13 child.c --- mpm/winnt/child.c 28 Feb 2003 14:02:42 - 1.13 +++ mpm/winnt/child.c 3 Mar 2003 22:31:15 - @@ -498,7 +498,7 @@ PCOMP_CONTEXT context = NULL; DWORD BytesRead; SOCKET nlsd; -int rv; +int rv, err_count = 0; apr_os_sock_get(&nlsd, lr->sd); @@ -538,15 +538,38 @@ rv = apr_get_netos_error(); if ((rv == APR_FROM_OS_ERROR(WSAEINVAL)) || (rv == APR_FROM_OS_ERROR(WSAENOTSOCK))) { -/* Hack alert. Occasionally, TransmitFile will not recycle the - * accept socket (usually when the client disconnects early). - * Get a new socket and try the call again. +/* Hack alert, we can get here because: + * 1) Occasionally, TransmitFile will not recycle the accept socket + *(usually when the client disconnects early). + * 2) There is VPN or Firewall software installed with buggy AcceptEx implementation + * 3) The webserver is using a dynamic address and it has changed */ +Sleep(0); +if (++err_count > 1000) { +apr_int32_t disconnected; + +/* abitrary socket call to test if the Listening socket is still valid */ +apr_status_t listen_rv = apr_socket_opt_get(lr->sd, APR_SO_DISCONNECTED, &disconnected); + +if (listen_rv == APR_SUCCESS) { +ap_log_error(APLOG_MARK,APLOG_ERR, listen_rv, ap_server_conf, + "AcceptEx error: If this occurs constantly and NO requests are being served " + "try using the WindowsSocketsWorkaround directive set to 'on'."); +err_count = 0; +} +else { +ap_log_error(APLOG_MARK,APLOG_ERR, listen_rv, ap_server_conf, + "The Listening socket is no longer valid. Dynamic address changed?"); +break; +} +} + closesocket(context->accept_socket); context->accept_socket = INVALID_SOCKET; ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, - "winnt_accept: AcceptEx failed due to early client " - "disconnect. Reallocate the accept socket and try again."); + "winnt_accept: AcceptEx failed, either early client disconnect, " + "dynamic address renewal, or incompatible VPN or Firewall software."); + continue; } else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) && @@ -558,6 +581,7 @@ Sleep(100); continue; } +err_count = 0; /* Wait for pending i/o. * Wake up once per second to check for shutdown . @@ -701,7 +725,7 @@ ap_update_child_status_from_indexes(0, thread_num, SERVER_READY, NULL); /* Grab a connection off the network */ -if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS) { +if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS || windows_sockets_workaround == 1) { context = win9x_get_connection(context); } else { @@ -769,7 +793,7 @@ static void create_listener_thread() { int tid; -if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS) { +if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS || windows_sockets_workaround == 1) { _beginthreadex(NULL, 0, (LPTHREAD_START_ROUTINE) win9x_accept, NULL, 0, &tid); } else { @@ -840,7 +864,7 @@ * Create the worker thread dispatch IOCompletionPort * on Windows NT/2000 */ -if (osver.dwPlatformId != VER_PLATFORM_WIN32_WINDOWS) { +if (osver.dwPlatformId != VER_PLATFORM_WIN32_WINDOWS && windows_sockets_workaround != 1) { /* Create the worker thread dispatch IOCP */ ThreadDispatchIOCP = CreateIoC
Re: PR 15282 AcceptEx problem
At 05:02 PM 2/28/2003, Allan Edwards wrote: >Based on the IP address renewal scenario you mention below, testing the >Listen socket (somehow, tbd) sounds like a good idea. > >Just to summarize, there are three conditions we need to consider: >1) we hit the TransmitFile recycle bug many times in a row >2) we have encountered an incompatible firewall or VPN >3) the IP address has changed You seem to have the failcases easily reproduced. Would you tack in some quick code that simply uses getsockopt(foo) (any option you like) to see if simply getting socket options for a now-broken listen socket will fail? A simple log message "getsockopt fails as expected" would be perfect. Just see if you can tickle the bug and test both the listen and accept socket. If the listen socket demonstrates the brokenness, we are good to go. If not, well, then the code gets ugly :-) >>Does accept() also fail? Can we use the 9x code to work around these >>sorts of problems? >No, accept() is fine. Using the 9x path *may* work but I haven't >tested it. The other option Bill S. suggested was to add a directive >that forces the 9x path. I tend to think that is preferable than a >run time decision because I'm not sure we can reliably determine >which path to take at runtime. >Note: taking the 9x path is only relevant to case 2) above. ++1 for some WindowsSocketsWorkaround on|off flag would be terrific!!!
Re: PR 15282 AcceptEx problem
William A. Rowe, Jr. wrote: This patch can't be applied... it actually introduces a denial of service problem if folks can simply early-disconnect a server some half dozen actually 100 :) times in a row... It isn't hard to work up such a tool. If it is possible for someone to externally tickle the TransmitFile socket recycle bug then I agree. Better; what if we test *which* socket failed. We are sort of helpless when the errors could be either the Listen and Accept socket. If the error is on the Listen socket, we should exit signaling the parent to do a restart with new listeners, if the error is on the accept socket we can just keep going. Based on the IP address renewal scenario you mention below, testing the Listen socket (somehow, tbd) sounds like a good idea. Just to summarize, there are three conditions we need to consider: 1) we hit the TransmitFile recycle bug many times in a row 2) we have encountered an incompatible firewall or VPN 3) the IP address has changed Instead, can we find some patch that will test AcceptEx? Perhaps we create a single local listen and attempt to connect and write to it, test that the AcceptEx succeeds, and otherwise emit some nasty warnings and throw a flag that puts us into the Win9x listener code? Testing AcceptEx is not easy, the failure only occurs when duplicating the socket between processes. But maybe testing the Listen socket provides us with enough information to indicate what the problem might be and suggest or perform corrective action. Does accept() also fail? Can we use the 9x code to work around these sorts of problems? No, accept() is fine. Using the 9x path *may* work but I haven't tested it. The other option Bill S. suggested was to add a directive that forces the 9x path. I tend to think that is preferable than a run time decision because I'm not sure we can reliably determine which path to take at runtime. Note: taking the 9x path is only relevant to case 2) above. I don't as much mind the Sleep(100) or even Sleep(0) so that we relinquish clock cycles. It's the arbitrary "foil the server 100 times and it will exit" problem. OK, so we can log a msg & continue instead of exiting. Since we may not be able to guarantee a false positive maybe we should modify the error message and say that "if NO requests are being served it is probably a firewall or VPN problem", but continue the accept loop. However, prior to logging this message we would need to test the Listen socket and, if it is bad, log a message saying that the IP address has probably become invalid, then exit the child and let the parent renew the Listeners. Because those only occur once the listen socket becomes invalidated, due to DHCP or some other change. You can trigger by reconfiguring TCP/IP to switch between two IP addresses. Again, we can recover gracefully if we ask the parent to do a respawn upon recreating all of *it's* listeners. i.e. whenever we hit some threshold of consecutive AcceptEx errors test the Listening socket (tbd somehow), and exit the child if it is bad. Allan
Re: PR 15282 AcceptEx problem
This patch can't be applied... it actually introduces a denial of service problem if folks can simply early-disconnect a server some half dozen times in a row... It isn't hard to work up such a tool. Better; what if we test *which* socket failed. We are sort of helpless when the errors could be either the Listen and Accept socket. If the error is on the Listen socket, we should exit signaling the parent to do a restart with new listeners, if the error is on the accept socket we can just keep going. More thoughts inline... At 01:34 PM 2/27/2003, Allan Edwards wrote: >As far as I can tell this is a bug in the Sprint >PCS Connect support for AcceptEx, (they install a >Winsock transport provider called BMI). However, it slips >through our checks and causes the accept thread to >hard loop and consume most of the cpu. Instead, can we find some patch that will test AcceptEx? Perhaps we create a single local listen and attempt to connect and write to it, test that the AcceptEx succeeds, and otherwise emit some nasty warnings and throw a flag that puts us into the Win9x listener code? >What happens is that in get_listeners_from_parent() >WSASocket *succeeeds* using the WSAProtocolInfo from >the parent however, AcceptEx in winnt_accept() fails >with WSAENOTSOCK. Does accept() also fail? Can we use the 9x code to work around these sorts of problems? >I don't see what we can do to fix this but we should >at least avoid hogging the cpu and log an informative >message. Unless there is a better idea I'll commit to 2.1 I don't as much mind the Sleep(100) or even Sleep(0) so that we relinquish clock cycles. It's the arbitrary "foil the server 100 times and it will exit" problem. >16327 may be related but I haven't been able to recreate >the problem with BlackIce or Norton Personal Firewall. Because those only occur once the listen socket becomes invalidated, due to DHCP or some other change. You can trigger by reconfiguring TCP/IP to switch between two IP addresses. Again, we can recover gracefully if we ask the parent to do a respawn upon recreating all of *it's* listeners. If the parent can test that the listeners are healthy (with some simple setsockopt call) then we can just leave it to the child to exit. As long as the parent performs a listener health check before each child process spawn, we should be much better off than we are today. Bill
Re: PR 15282 AcceptEx problem
Perhaps we need a winnt mpm directive to force the server to use the Win9* accept code path. Whould be a terrible thing to do on a production level server (for performance reasons) but quite okay for most of the folks that are seeing personal firewalls collide with our use of AcceptEx. mmm... that might work. PCS Connect has no problem with the accept() call. Allan
Re: PR 15282 AcceptEx problem
Allan Edwards wrote: Bill Stoddard wrote: Humm... how do our friends at MS solve this in IIS? It only happens because of our parent-child process model. If you run -X the problem goes away. It's the socket duplication that seems to bite us. Allan Perhaps we need a winnt mpm directive to force the server to use the Win9* accept code path. Whould be a terrible thing to do on a production level server (for performance reasons) but quite okay for most of the folks that are seeing personal firewalls collide with our use of AcceptEx. Bill
Re: PR 15282 AcceptEx problem
Bill Stoddard wrote: Humm... how do our friends at MS solve this in IIS? It only happens because of our parent-child process model. If you run -X the problem goes away. It's the socket duplication that seems to bite us. Allan
Re: PR 15282 AcceptEx problem
Humm... how do our friends at MS solve this in IIS? Bill Allan Edwards wrote: As far as I can tell this is a bug in the Sprint PCS Connect support for AcceptEx, (they install a Winsock transport provider called BMI). However, it slips through our checks and causes the accept thread to hard loop and consume most of the cpu. What happens is that in get_listeners_from_parent() WSASocket *succeeeds* using the WSAProtocolInfo from the parent however, AcceptEx in winnt_accept() fails with WSAENOTSOCK. I don't see what we can do to fix this but we should at least avoid hogging the cpu and log an informative message. Unless there is a better idea I'll commit to 2.1 16327 may be related but I haven't been able to recreate the problem with BlackIce or Norton Personal Firewall. Allan Index: child.c === RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v retrieving revision 1.12 diff -u -d -b -r1.12 child.c --- child.c26 Feb 2003 21:55:54 -1.12 +++ child.c27 Feb 2003 16:38:59 - @@ -498,7 +498,7 @@ PCOMP_CONTEXT context = NULL; DWORD BytesRead; SOCKET nlsd; -int rv; +int rv, err_count = 0; apr_os_sock_get(&nlsd, lr->sd); @@ -547,6 +547,14 @@ ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, "winnt_accept: AcceptEx failed due to early client " "disconnect. Reallocate the accept socket and try again."); + +Sleep(100); +if (++err_count > 100) { +ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf, + "AcceptEx unrecoverable error, " + "possibly incompatible firewall or VPN software is installed."); +break; +} continue; } else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) && @@ -558,6 +566,7 @@ Sleep(100); continue; } +err_count = 0; /* Wait for pending i/o. * Wake up once per second to check for shutdown . Index: child.c === RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v retrieving revision 1.12 diff -u -d -b -r1.12 child.c --- child.c 26 Feb 2003 21:55:54 - 1.12 +++ child.c 27 Feb 2003 16:38:59 - @@ -498,7 +498,7 @@ PCOMP_CONTEXT context = NULL; DWORD BytesRead; SOCKET nlsd; -int rv; +int rv, err_count = 0; apr_os_sock_get(&nlsd, lr->sd); @@ -547,6 +547,14 @@ ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, "winnt_accept: AcceptEx failed due to early client " "disconnect. Reallocate the accept socket and try again."); + +Sleep(100); +if (++err_count > 100) { +ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf, + "AcceptEx unrecoverable error, " + "possibly incompatible firewall or VPN software is installed."); +break; +} continue; } else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) && @@ -558,6 +566,7 @@ Sleep(100); continue; } +err_count = 0; /* Wait for pending i/o. * Wake up once per second to check for shutdown .
PR 15282 AcceptEx problem
As far as I can tell this is a bug in the Sprint PCS Connect support for AcceptEx, (they install a Winsock transport provider called BMI). However, it slips through our checks and causes the accept thread to hard loop and consume most of the cpu. What happens is that in get_listeners_from_parent() WSASocket *succeeeds* using the WSAProtocolInfo from the parent however, AcceptEx in winnt_accept() fails with WSAENOTSOCK. I don't see what we can do to fix this but we should at least avoid hogging the cpu and log an informative message. Unless there is a better idea I'll commit to 2.1 16327 may be related but I haven't been able to recreate the problem with BlackIce or Norton Personal Firewall. Allan Index: child.c === RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v retrieving revision 1.12 diff -u -d -b -r1.12 child.c --- child.c 26 Feb 2003 21:55:54 - 1.12 +++ child.c 27 Feb 2003 16:38:59 - @@ -498,7 +498,7 @@ PCOMP_CONTEXT context = NULL; DWORD BytesRead; SOCKET nlsd; -int rv; +int rv, err_count = 0; apr_os_sock_get(&nlsd, lr->sd); @@ -547,6 +547,14 @@ ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, "winnt_accept: AcceptEx failed due to early client " "disconnect. Reallocate the accept socket and try again."); + +Sleep(100); +if (++err_count > 100) { +ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf, + "AcceptEx unrecoverable error, " + "possibly incompatible firewall or VPN software is installed."); +break; +} continue; } else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) && @@ -558,6 +566,7 @@ Sleep(100); continue; } +err_count = 0; /* Wait for pending i/o. * Wake up once per second to check for shutdown . Index: child.c === RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v retrieving revision 1.12 diff -u -d -b -r1.12 child.c --- child.c 26 Feb 2003 21:55:54 - 1.12 +++ child.c 27 Feb 2003 16:38:59 - @@ -498,7 +498,7 @@ PCOMP_CONTEXT context = NULL; DWORD BytesRead; SOCKET nlsd; -int rv; +int rv, err_count = 0; apr_os_sock_get(&nlsd, lr->sd); @@ -547,6 +547,14 @@ ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf, "winnt_accept: AcceptEx failed due to early client " "disconnect. Reallocate the accept socket and try again."); + +Sleep(100); +if (++err_count > 100) { +ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf, + "AcceptEx unrecoverable error, " + "possibly incompatible firewall or VPN software is installed."); +break; +} continue; } else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) && @@ -558,6 +566,7 @@ Sleep(100); continue; } +err_count = 0; /* Wait for pending i/o. * Wake up once per second to check for shutdown .