Re: anybody seeing socket-related segfaults today?
Jeff Trawick <[EMAIL PROTECTED]> writes: > Jeff Trawick <[EMAIL PROTECTED]> writes: > > > Throughout today I've been seeing very intermittent regression > > failures on AIX. The segfault happens when trying to get the IP > > address string from a socket addr. > > > > core_create_conn() calls apr_socket_addr_get(), which returns > > APR_SUCCESS. But somehow we have NULL for the returned socket address > > so apr_sockaddr_ip_get() bombs. > > The immediate cause of the problem is that ap_queue_pop() is returning > EINVAL and worker_thread() didn't react to that and instead tried to > process the would-be socket. > > I suspect that the EINVAL from ap_queue_pop() is from trying to use an > invalid (cleaned up?) pthread mutex. AIX tends to notice errors on > mutexes and fail the call rather than venturing into unpredictable > behavior. Yep, the mutex has already been cleaned up. It is the mutex unlock operation that fails. This is termination (ungraceful). We don't wait for worker threads to terminate; sometimes the main thread has cleaned up pchild and bailed by the time the worker threads get dispatched from the interrupt-all and then release the mutex. -- Jeff Trawick | [EMAIL PROTECTED] Born in Roswell... married an alien...
Re: anybody seeing socket-related segfaults today?
Jeff Trawick <[EMAIL PROTECTED]> writes: > Throughout today I've been seeing very intermittent regression > failures on AIX. The segfault happens when trying to get the IP > address string from a socket addr. > > core_create_conn() calls apr_socket_addr_get(), which returns > APR_SUCCESS. But somehow we have NULL for the returned socket address > so apr_sockaddr_ip_get() bombs. The immediate cause of the problem is that ap_queue_pop() is returning EINVAL and worker_thread() didn't react to that and instead tried to process the would-be socket. I suspect that the EINVAL from ap_queue_pop() is from trying to use an invalid (cleaned up?) pthread mutex. AIX tends to notice errors on mutexes and fail the call rather than venturing into unpredictable behavior. I just committed a change to worker to not process the socket if rv != APR_SUCCESS. Previously we avoided processing the socket if rv == APR_EINTR or csd is NULL. (But no logic in ap_queue_pop() or caller to set csd to NULL on the EINVAL error!) I recall the fix to check for csd == NULL being very helpful a couple of months back. I hope rv was non-zero in that case (i.e., I hope that problem is still fixed)! -- Jeff Trawick | [EMAIL PROTECTED] Born in Roswell... married an alien...
Re: anybody seeing socket-related segfaults today?
Aaron Bannert <[EMAIL PROTECTED]> writes: > Could this change have interfered with Unix? > > > Modified:server listen.c > Log: > Here's the patch that really sucks. old_listeners points to an array > of apr_socket objects already destroyed by their cleanups, and in any > case they now live in invalid memory. Extend their lifetimes. I've thought about it a few times :) I don't see any connection at the moment though. I see from your post there's probably some other bad stuff happening with pools today, or we're both getting bit by the same problem :) -- Jeff Trawick | [EMAIL PROTECTED] Born in Roswell... married an alien...
Re: anybody seeing socket-related segfaults today?
Could this change have interfered with Unix? Modified:server listen.c Log: Here's the patch that really sucks. old_listeners points to an array of apr_socket objects already destroyed by their cleanups, and in any case they now live in invalid memory. Extend their lifetimes. This implies that the process pool grows on every restart for no good reason. One possible solution is to let the old pconf survive until the new pconf is alive. Another is to create the listeners in a subpool of process->pool, destroyed after the old_listeners are closed. Either which way, a better solution exists, but this closes the immediate bug. [How haven't we been segfaulting in unix on restarts before this patch, gurus?] Revision ChangesPath 1.77 +4 -5 httpd-2.0/server/listen.c -aaron On Wed, Mar 20, 2002 at 04:07:23PM -0500, Jeff Trawick wrote: > Throughout today I've been seeing very intermittent regression > failures on AIX. The segfault happens when trying to get the IP > address string from a socket addr. > > core_create_conn() calls apr_socket_addr_get(), which returns > APR_SUCCESS. But somehow we have NULL for the returned socket address > so apr_sockaddr_ip_get() bombs. > > It is intermittent, doesn't seem to matter what kind of request, and > I've only seen it on a couple of AIX boxes. Probably a pool misuse of > some sort :) > > The earliest I saw it happen was 8:00 EST today, but prior to that the > server wouldn't build on AIX for some hours, so I don't know when the > problem was introduced/exposed.