Hello,

I'm writing about this apparent bug to this mailing list, since it is not 
relevant to the newest version of Apache 2.4.X (as developers wants for bug 
reporting), nevertheless to the last version in 2.2.X branch and maybe some 
other had met the same problem as we did. However, due to similarities in the 
implementation of thread communication between 2.2 and 2.4 using POD and signal 
handling, we cannot exclude that this would not occur in the newest 2.4.X 
versions on our platform too.

Our server handling approx. 3.5mio requests/day suffers from occasionally OOM 
killer events caused by the Apache processes that did not exited properly after 
reaching the MaxRequestPerChild limit and thus eating tons of RAM. After short 
research we find out, that it is caused by "half-dead" lingering apache 
processes consisting of one thread only ("child main thread" in mpm_worker 
implementation) waiting indefinitely in syscall read().

Here is the stack obtained by gstack <pid>:

#0  0x00007fbbaaa0f3fd in read () from /lib64/libpthread.so.0
#1  0x0000000000454e30 in ap_mpm_pod_check ()
#2  0x0000000000429e60 in child_main ()
#3  0x0000000000453902 in make_child ()
#4  0x000000000045398b in startup_children ()
#5  0x0000000000454201 in ap_mpm_run ()
#6  0x000000000042a9c0 in main ()

All other worker threads, listener thread, etc. gone, but NOT this one. Thus 
process resources are still held in memory.
If I understand the communication between child threads well, the listener 
thread wakes up the main thread (being blocked in ap_mpm_pod_check/read() and 
waiting for messages from the parent process) when MaxRequestsPerChild limit is 
reached.
As one can see from the worker.c source code, listener thread tries to notify 
child main thread via sending SIGTERM:

..(excerpt)
...  ap_close_listeners();
    ap_queue_term(worker_queue);
    dying = 1;
    ap_scoreboard_image->parent[process_slot].quiescing = 1;

    /* wake up the main thread */
       kill(ap_my_pid, SIGTERM);   ----- this does not do the wanted in our 
case stuff - main thread still stuck in mpm_pod_check():read()

    apr_thread_exit(thd, APR_SUCCESS);
    return NULL;
}

So the  kill(ap_my_pid, SIGTERM) is unable to interrupt read() syscall, that 
should return with EINTR a thus exit the ap_mpm_pod_check() and jump out of the 
child_main() function and finally exit.
But this does not happen. It should - since the child main thread is the only 
one, who has the signal SIGTERM UNBLOCKED and should receive it. Dunno, why is 
this so.

Maybe this is some bug relevant to the specific gclibc/linux kernel?

We had to apply dirty patch - make the POD IN pipe read end nonblocking and 
insert a sleep for a while into the loop inside child_main() in order not to 
hog the CPU:

child_main()
...
  while (1) {
            rv = ap_mpm_pod_check(pod);
            if (rv == AP_NORESTART) {
                /* see if termination was triggered while we slept */
                switch(terminate_mode) {
                case ST_GRACEFUL:
                    rv = AP_GRACEFUL;
                    break;
                case ST_UNGRACEFUL:
                    rv = AP_RESTART;
                    break;
                }
            }
            if (rv == AP_GRACEFUL || rv == AP_RESTART) {
                /* make sure the start thread has finished;
                 * signal_threads() and join_workers depend on that
                 */
                join_start_thread(start_thread_id);
                signal_threads(rv == AP_GRACEFUL ? ST_GRACEFUL : ST_UNGRACEFUL);
                break;
            }
            sleep(1);   //go to sleep for a while - any non-blocked signal can 
wake up us quickly
        } 


Yes, ugly and dirty, however we needed to recover stable server behavior. When 
master process sends signal to POD, there could be delay up to one second due 
to child main thread sleep when it reacts.
After this patch apache runs and works as expected recycling the child 
processes after MaxRequestsPerChild.

Our MPM configuration:

<IfModule worker.c>
ServerLimit 4
ThreadLimit     256
StartServers         2
MinSpareThreads      128
MaxSpareThreads      384
MaxClients          1024
ThreadsPerChild      256
MaxRequestsPerChild  10000
MaxMemFree 2048
</IfModule>

We are not using any other Apache module except bundled and PHP (libphp5.so).

Kernel version: 3.10.63-1
Glibc: glibc-2.17-4.4.1.x86_64
Apache: 2.2.34, with bundled APR/APR UTIL.


Does anybody have the same experiences or suggestions of what could be wrong?

Jiri Fartak



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Reply via email to