Here's what I see concerning the graceful restart problem on Solaris.
Setup using the prefork MPM with two http listeners. Accept mutex is
pthread.

Short version: child processes that do not manage to acquire the accept
mutex during graceful restart and before the next generation child
processes get started will stay hanging in acquiring the accept mutex.

Long version of what happens when a graceful restart is issued:

1) parent calls ap_mpm_pod_killpg for all (here: 6) children
   This quickly produces 6 "OPTIONS *" requests.
2) First child accepts and processes one "OPTIONS *" request
   and then exits
3) Second child gets the accept mutex and calls accept
4) Parent calls ap_mpm_safe_kill with AP_SIG_GRACEFUL for all
   children pids. All children execute signal handler,
   close the listening sockets and set die_now=1
5) Second child accepts and processes one
   "OPTIONS *" and exits
6) Third child gets the accept mutex lock, sees die_now=1
   unlocks the lock and exits
7) Three more children still wait for the accept mutex
8) parent starts next generation child processes
9) These new children wait for the accept mutex.
   The mutex is now always acquired by one of the new children.
   First thing they do is work on the remaining 4 "OPTIONS *"
   requests. The remaining old children never get the accept mutex
   and keep hanging.

What is strange to me: why isn't the GRACEFUL signal effective in
interrupting the waiting for the accept mutex? Is that expected?

The children that hang sit inside accept_mutex_on() and there in
apr_proc_mutex_lock(). This call does not return. The impl of it looks
like it should return in case of a signal since we are using a pthread
mutex here.

Truss shows:

23759:  lwp_mutex_timedlock(0xFF0F0000, 0x00000000) (sleeping...)
23759:          mutex type: USYNC_PROCESS|LOCK_PRIO_INHERIT|LOCK_ROBUST
23759:      Received signal #16, SIGUSR1, in lwp_mutex_timedlock() [caught]
23759:  lwp_mutex_timedlock(0xFF0F0000, 0x00000000)     Err#4 EINTR
23759:          mutex type: USYNC_THREAD
23759:  lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF
[0x0000FFFF]
23759:  close(5)                                        = 0
23759:  close(3)                                        = 0
23759:  setcontext(0xFFBFEF40)
23759:  lwp_mutex_timedlock(0xFF0F0000, 0x00000000) (sleeping...)
23759:          mutex type: USYNC_THREAD

So we see that the syscall returns with EINTR but after closing the
listeners it calls again lwp_mutex_timedlock(). The upper level
apr_proc_mutex_lock() call does not return.

Any info, what in the above steps 1)-9) looks broken is appreciated.

If I add a short delay between the "OPTIONS *" requests and the
ap_mpm_safe_kill all old children process one of those requests and then
set die_now to 1 because they see that there's a new generation. Then
they actually exit.

Rainer

Reply via email to