Re: anybody seeing socket-related segfaults today?

2002-03-20 Thread Jeff Trawick

Jeff Trawick <[EMAIL PROTECTED]> writes:

> Jeff Trawick <[EMAIL PROTECTED]> writes:
> 
> > Throughout today I've been seeing very intermittent regression
> > failures on AIX. The segfault happens when trying to get the IP
> > address string from a socket addr.
> > 
> > core_create_conn() calls apr_socket_addr_get(), which returns
> > APR_SUCCESS.  But somehow we have NULL for the returned socket address
> > so apr_sockaddr_ip_get() bombs.
> 
> The immediate cause of the problem is that ap_queue_pop() is returning
> EINVAL and worker_thread() didn't react to that and instead tried to
> process the would-be socket.
> 
> I suspect that the EINVAL from ap_queue_pop() is from trying to use an
> invalid (cleaned up?) pthread mutex.  AIX tends to notice errors on
> mutexes and fail the call rather than venturing into unpredictable
> behavior.

Yep, the mutex has already been cleaned up.  It is the mutex unlock
operation that fails.  This is termination (ungraceful).  We don't
wait for worker threads to terminate; sometimes the main thread has
cleaned up pchild and bailed by the time the worker threads get
dispatched from the interrupt-all and then release the mutex.

-- 
Jeff Trawick | [EMAIL PROTECTED]
Born in Roswell... married an alien...



Re: anybody seeing socket-related segfaults today?

2002-03-20 Thread Jeff Trawick

Jeff Trawick <[EMAIL PROTECTED]> writes:

> Throughout today I've been seeing very intermittent regression
> failures on AIX. The segfault happens when trying to get the IP
> address string from a socket addr.
> 
> core_create_conn() calls apr_socket_addr_get(), which returns
> APR_SUCCESS.  But somehow we have NULL for the returned socket address
> so apr_sockaddr_ip_get() bombs.

The immediate cause of the problem is that ap_queue_pop() is returning
EINVAL and worker_thread() didn't react to that and instead tried to
process the would-be socket.

I suspect that the EINVAL from ap_queue_pop() is from trying to use an
invalid (cleaned up?) pthread mutex.  AIX tends to notice errors on
mutexes and fail the call rather than venturing into unpredictable
behavior.

I just committed a change to worker to not process the socket if rv !=
APR_SUCCESS.  Previously we avoided processing the socket if rv ==
APR_EINTR or csd is NULL.  (But no logic in ap_queue_pop() or caller
to set csd to NULL on the EINVAL error!)

I recall the fix to check for csd == NULL being very helpful a couple
of months back.  I hope rv was non-zero in that case (i.e., I hope
that problem is still fixed)!

-- 
Jeff Trawick | [EMAIL PROTECTED]
Born in Roswell... married an alien...



Re: anybody seeing socket-related segfaults today?

2002-03-20 Thread Jeff Trawick

Aaron Bannert <[EMAIL PROTECTED]> writes:

> Could this change have interfered with Unix?
> 
> 
>   Modified:server   listen.c
>   Log:  
> Here's the patch that really sucks.  old_listeners points to an array   
> of apr_socket objects already destroyed by their cleanups, and in any   
> case they now live in invalid memory.  Extend their lifetimes.  

I've thought about it a few times :)  I don't see any connection at
the moment though.

I see from your post there's probably some other bad stuff happening
with pools today, or we're both getting bit by the same problem :)

-- 
Jeff Trawick | [EMAIL PROTECTED]
Born in Roswell... married an alien...



Re: anybody seeing socket-related segfaults today?

2002-03-20 Thread Aaron Bannert

Could this change have interfered with Unix?


  Modified:server   listen.c
  Log:  
Here's the patch that really sucks.  old_listeners points to an array   
of apr_socket objects already destroyed by their cleanups, and in any   
case they now live in invalid memory.  Extend their lifetimes.  

This implies that the process pool grows on every restart for no good   
reason.  One possible solution is to let the old pconf survive until
the new pconf is alive.  Another is to create the listeners in a subpool
of process->pool, destroyed after the old_listeners are closed. 

Either which way, a better solution exists, but this closes the immediate   
bug.  [How haven't we been segfaulting in unix on restarts before this  
patch, gurus?]  

  Revision  ChangesPath 
  1.77  +4 -5  httpd-2.0/server/listen.c

-aaron


On Wed, Mar 20, 2002 at 04:07:23PM -0500, Jeff Trawick wrote:
> Throughout today I've been seeing very intermittent regression
> failures on AIX. The segfault happens when trying to get the IP
> address string from a socket addr.
> 
> core_create_conn() calls apr_socket_addr_get(), which returns
> APR_SUCCESS.  But somehow we have NULL for the returned socket address
> so apr_sockaddr_ip_get() bombs.
> 
> It is intermittent, doesn't seem to matter what kind of request, and
> I've only seen it on a couple of AIX boxes.  Probably a pool misuse of
> some sort :)
> 
> The earliest I saw it happen was 8:00 EST today, but prior to that the
> server wouldn't build on AIX for some hours, so I don't know when the
> problem was introduced/exposed.