Re: pytest results for 2.4.59
Hi Jean-Frederic and all, you didn't write at what point in time you take the thread dump. I see the SIGTERM messages logged during test execution always during the last test in each group (http2, md, ...) just because that is the time the logs are checked by teardown for error messages. At the time the test complains it already starts to kill the children and at least during my test runs it success with killing them (I think). So finding a good point in time to attach the debugger and see the right situation might not be easy? When you say Yann's patch helps, it means especially there are not more SIGTERM messages in the logs resp. no more teardown checks failing? Best regards, Rainer Am 06.04.24 um 17:32 schrieb jean-frederic clere: On 4/6/24 13:10, Yann Ylavic wrote: On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere wrote: On 4/5/24 07:55, Ruediger Pluem wrote: Are you able to provide a stacktrace of the hanging process (thread apply all bt full)? It seems pthread_kill(t, 0) returns 0 even the thread t has exited... older version of fedora will return 3 (I have tried fc28) If pthread_kill() does not work we probably should use the global "dying" variable like in mpm_event. But it's not clear from your earlier "bt full" whether there are other threads, could you try "thread apply all bt full" instead to show all the threads? (gdb) thread apply all bt full Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)): #0 0x7ffbf429b087 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at ../sysdeps/unix/sysv/linux/select.c:69 sc_ret = -4 sc_cancel_oldtype = 0 sc_ret = s = us = ns = ts64 = {tv_sec = 0, tv_nsec = 155950744} pts64 = 0x7fff56cbb050 r = #1 0x7ffbf43d9d92 in apr_sleep (t=t@entry=50) at time/unix/time.c:249 tv = {tv_sec = 0, tv_usec = 50} #2 0x00440733 in join_workers (listener=0x87c170, threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069 iter = 7 i = rv = thread_rv = 0 #3 0x004412d9 in child_main (child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0) at worker.c:1310 threads = 0x91e150 rv = 1 ts = 0x815a78 thread_attr = 0x815a98 start_thread_id = 0x815b08 i = #4 0x0044161a in make_child (s=0x818d00, slot=slot@entry=0, bucket=0) at worker.c:1376 pid = 0 #5 0x004416be in startup_children (number_to_start=3) at worker.c:1403 i = 0 #6 0x004428f9 in worker_run (_pconf=, plog=0x81b998, s=0x818d00) at worker.c:1928 listen_buckets = 0x875480 num_buckets = 1 remaining_children_to_start = rv = id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000" i = #7 0x00456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8, plog=0x81b998, s=0x818d00) at mpm_common.c:102 pHook = n = 0 rv = -1 #8 0x0043350e in main (argc=, argv=out>) at main.c:882 c = 102 'f' showcompile = --Type for more, q to quit, c to continue without paging-- showdirectives = confname = def_server_root = temp_error_log = error = process = 0x7ea4c8 pconf = 0x7ec3e8 plog = 0x81b998 ptemp = 0x815678 pcommands = opt = 0x810ef0 rv = mod = opt_arg = 0x7fff56cbcb64 "/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf" signal_server = rc = (gdb) I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the ap_log_error() as it is not easy to get a core otherwise. It's clear from the main thread's backtrace that it's waiting for the listener in the "iter" loop, but nothing tells if the listener already exited or not. The listener for instance could be waiting indefinitely apr_pollset_poll() at this point, and since there is no pollset wakeup in mpm_worker I don't think that wakeup_listener() can help here. According to my tests using assert(0) in the join_workers() in different location, the listener thread is stopped by wakeup_listener() but the pthread_kill() doesn't report that. So maybe we need to add an apr_pollset_wakeup() in wakeup_listener() too, like in mpm_event too. Overall something like the attached patch? Yes the attached patch helps Regards; Yann.
Re: pytest results for 2.4.59
On 4/6/24 13:10, Yann Ylavic wrote: On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere wrote: On 4/5/24 07:55, Ruediger Pluem wrote: Are you able to provide a stacktrace of the hanging process (thread apply all bt full)? It seems pthread_kill(t, 0) returns 0 even the thread t has exited... older version of fedora will return 3 (I have tried fc28) If pthread_kill() does not work we probably should use the global "dying" variable like in mpm_event. But it's not clear from your earlier "bt full" whether there are other threads, could you try "thread apply all bt full" instead to show all the threads? (gdb) thread apply all bt full Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)): #0 0x7ffbf429b087 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at ../sysdeps/unix/sysv/linux/select.c:69 sc_ret = -4 sc_cancel_oldtype = 0 sc_ret = s = us = ns = ts64 = {tv_sec = 0, tv_nsec = 155950744} pts64 = 0x7fff56cbb050 r = #1 0x7ffbf43d9d92 in apr_sleep (t=t@entry=50) at time/unix/time.c:249 tv = {tv_sec = 0, tv_usec = 50} #2 0x00440733 in join_workers (listener=0x87c170, threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069 iter = 7 i = rv = thread_rv = 0 #3 0x004412d9 in child_main (child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0) at worker.c:1310 threads = 0x91e150 rv = 1 ts = 0x815a78 thread_attr = 0x815a98 start_thread_id = 0x815b08 i = #4 0x0044161a in make_child (s=0x818d00, slot=slot@entry=0, bucket=0) at worker.c:1376 pid = 0 #5 0x004416be in startup_children (number_to_start=3) at worker.c:1403 i = 0 #6 0x004428f9 in worker_run (_pconf=, plog=0x81b998, s=0x818d00) at worker.c:1928 listen_buckets = 0x875480 num_buckets = 1 remaining_children_to_start = rv = id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000" i = #7 0x00456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8, plog=0x81b998, s=0x818d00) at mpm_common.c:102 pHook = n = 0 rv = -1 #8 0x0043350e in main (argc=, argv=out>) at main.c:882 c = 102 'f' showcompile = --Type for more, q to quit, c to continue without paging-- showdirectives = confname = def_server_root = temp_error_log = error = process = 0x7ea4c8 pconf = 0x7ec3e8 plog = 0x81b998 ptemp = 0x815678 pcommands = opt = 0x810ef0 rv = mod = opt_arg = 0x7fff56cbcb64 "/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf" signal_server = rc = (gdb) I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the ap_log_error() as it is not easy to get a core otherwise. It's clear from the main thread's backtrace that it's waiting for the listener in the "iter" loop, but nothing tells if the listener already exited or not. The listener for instance could be waiting indefinitely apr_pollset_poll() at this point, and since there is no pollset wakeup in mpm_worker I don't think that wakeup_listener() can help here. According to my tests using assert(0) in the join_workers() in different location, the listener thread is stopped by wakeup_listener() but the pthread_kill() doesn't report that. So maybe we need to add an apr_pollset_wakeup() in wakeup_listener() too, like in mpm_event too. Overall something like the attached patch? Yes the attached patch helps Regards; Yann. -- Cheers Jean-Frederic
Re: pytest results for 2.4.59
On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere wrote: > > On 4/5/24 07:55, Ruediger Pluem wrote: > > > > Are you able to provide a stacktrace of the hanging process (thread apply > > all bt full)? > > It seems pthread_kill(t, 0) returns 0 even the thread t has exited... > older version of fedora will return 3 (I have tried fc28) If pthread_kill() does not work we probably should use the global "dying" variable like in mpm_event. But it's not clear from your earlier "bt full" whether there are other threads, could you try "thread apply all bt full" instead to show all the threads? It's clear from the main thread's backtrace that it's waiting for the listener in the "iter" loop, but nothing tells if the listener already exited or not. The listener for instance could be waiting indefinitely apr_pollset_poll() at this point, and since there is no pollset wakeup in mpm_worker I don't think that wakeup_listener() can help here. So maybe we need to add an apr_pollset_wakeup() in wakeup_listener() too, like in mpm_event too. Overall something like the attached patch? Regards; Yann. Index: server/mpm/worker/worker.c === --- server/mpm/worker/worker.c (revision 1916768) +++ server/mpm/worker/worker.c (working copy) @@ -125,10 +125,11 @@ static int max_workers = 0; static int server_limit = 0; static int thread_limit = 0; static int had_healthy_child = 0; -static int dying = 0; +static volatile int dying = 0; static int workers_may_exit = 0; static int start_thread_may_exit = 0; static int listener_may_exit = 0; +static int listener_is_wakeable = 0; /* Pollset supports APR_POLLSET_WAKEABLE */ static int requests_this_child; static int num_listensocks = 0; static int resource_shortage = 0; @@ -272,6 +273,15 @@ static void close_worker_sockets(void) static void wakeup_listener(void) { listener_may_exit = 1; + +/* Unblock the listener if it's poll()ing */ +if (worker_pollset && listener_is_wakeable) { +apr_pollset_wakeup(worker_pollset); +} + +/* unblock the listener if it's waiting for a worker */ +ap_queue_info_term(worker_queue_info); + if (!listener_os_thread) { /* XXX there is an obscure path that this doesn't handle perfectly: * right after listener thread is created but before @@ -280,10 +290,6 @@ static void wakeup_listener(void) */ return; } - -/* unblock the listener if it's waiting for a worker */ -ap_queue_info_term(worker_queue_info); - /* * we should just be able to "kill(ap_my_pid, LISTENER_SIGNAL)" on all * platforms and wake up the listener thread since it is the only thread @@ -716,6 +722,7 @@ static void * APR_THREAD_FUNC listener_thread(apr_ ap_close_listeners_ex(my_bucket->listeners); ap_queue_info_free_idle_pools(worker_queue_info); ap_queue_term(worker_queue); + dying = 1; ap_scoreboard_image->parent[process_slot].quiescing = 1; @@ -861,6 +868,10 @@ static void create_listener_thread(thread_starter static void setup_threads_runtime(void) { ap_listen_rec *lr; +int pollset_flags, i; +const int good_methods[] = { APR_POLLSET_KQUEUE, + APR_POLLSET_PORT, + APR_POLLSET_EPOLL }; apr_status_t rv; /* All threads (listener, workers) and synchronization objects (queues, @@ -894,9 +905,31 @@ static void setup_threads_runtime(void) } /* Create the main pollset */ -rv = apr_pollset_create(&worker_pollset, num_listensocks, pruntime, -APR_POLLSET_NOCOPY); +pollset_flags = APR_POLLSET_NOCOPY | APR_POLLSET_NODEFAULT | APR_POLLSET_WAKEABLE; +for (i = 0; i < sizeof(good_methods) / sizeof(good_methods[0]); i++) { +rv = apr_pollset_create_ex(&worker_pollset, num_listensocks, pruntime, + pollset_flags, good_methods[i]); +if (rv == APR_SUCCESS) { +listener_is_wakeable = 1; +break; +} +} if (rv != APR_SUCCESS) { +pollset_flags &= ~APR_POLLSET_WAKEABLE; +for (i = 0; i < sizeof(good_methods) / sizeof(good_methods[0]); i++) { +rv = apr_pollset_create_ex(&worker_pollset, num_listensocks, pruntime, + pollset_flags, good_methods[i]); +if (rv == APR_SUCCESS) { +break; +} +} +} +if (rv != APR_SUCCESS) { +pollset_flags &= ~APR_POLLSET_NODEFAULT; +rv = apr_pollset_create(&worker_pollset, num_listensocks, pruntime, +pollset_flags); +} +if (rv != APR_SUCCESS) { ap_log_error(APLOG_MARK, APLOG_EMERG, rv, ap_server_conf, APLOGNO(03285) "Couldn't create pollset in thread;" " check system or user limits"); @@ -1031,19 +1064,17 @@ static void join_workers(apr_threa
Re: pytest results for 2.4.59
On 4/5/24 07:55, Ruediger Pluem wrote: On 4/5/24 12:59 AM, Rainer Jung wrote: I think I fixed all test failures, hopefully in the correct way. More eyes welcome. I have a few additional sporadic ERRORS: A] ERROR during teardown check for log file errors or warnings (twice): 04.04.2024 21:14:42.205465 ___ ERROR at teardown of TestStatus.test_md_920_020 04.04.2024 21:14:42.205465 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE... 04.04.2024 21:14:42.205465 E AssertionError: apache logged 1 errors and 0 warnings: 04.04.2024 21:14:42.205465 E [Thu Apr 04 21:12:29.381511 2024] [md:error] [pid 4169] (22)Invalid argument: no certificates in non-empty chain /path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem 04.04.2024 21:03:26.382051 ___ ERROR at teardown of TestStatus.test_md_920_020 04.04.2024 21:03:26.382360 ERROR modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE... 04.04.2024 21:03:26.382051 E AssertionError: apache logged 1 errors and 1 warnings: 04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924286 2024] [md:error] [pid 8717:tid 139629962274560] (20014)Internal error (specific information not available): test-md-702-041-1712256790.org: asked to retrieve chain, but no order in context 04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924229 2024] [md:warn] [pid 8717:tid 139629962274560] error generate pkey RSA 3072 B] Hanging httpd child processes This happens only on RHEL 9 with worker MPM and can be notices by a dramatic slowdown of the tests. There's a lot of messages AH00045: child process 1067703 still did not exit, sending a SIGTERM and AH00276: the listener thread didn't exit It happened in modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200] modules/md/test_920_status.py::TestStatus::test_md_920_020 modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500] but I don't know, whether it might happen elsewhere also, because it is sporadic. What I see in the error logs for one hanging child process: - most threads terminate with [Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 140619680433728] ssl_engine_kernel.c(2223): [client 127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully [Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686] bio_filter_out_write: flush [Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] ssl_engine_io.c(1146): [client 127.0.0.1:40686] AH02001: Connection closed to child 0 with standard shutdown (server test1.tests.httpd.apache.org:443) - watchdog thread terminates (?) with [Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] md_reg.c(1163): test-md-810-003a-1712260944.org: staging done [Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] AH10059: The Managed Domain test-md-810-003a-1712260944.org has been setup and changes will be activated on next (graceful) server restart. [Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(229): AH10107: next run in 11 hours 59 minutes 58 seconds [Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] mod_md_drive.c(236): AH10058: md watchdog stopping [Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 140619697219136] mod_watchdog.c(257): AH02973: Singleton Watchdog (_md_renew_) stopping - one worker thread seems not to stop: [Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 140619672041024] protocol.c(714): [client 127.0.0.1:48748] Request received from client: GET /.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1 [Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] mod_md.c(1385): [client 127.0.0.1:48748] loading challenge for test-md-810-003a-1712260944.org (/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E) [Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 140619672041024] http_filters.c(1141): [client 127.0.0.1:48748] Response sent with status 200, headers: [Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1150): [client 127.0.0.1:48748] Date: Thu, 04 Apr 2024 20:42:59 GMT [Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 140619672041024] http_filters.c(1153): [client 127.0.0.1:48748] Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5 [Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748] Content-Length: 88 [Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 140619672041024] http_filters.c(971): [client 127.0.0.1:48748] Connection: clo