Re: pytest results for 2.4.59

2024-04-06 Thread Rainer Jung

Hi Jean-Frederic and all,

you didn't write at what point in time you take the thread dump. I see 
the SIGTERM messages logged during test execution always during the last 
test in each group (http2, md, ...) just because that is the time the 
logs are checked by teardown for error messages. At the time the test 
complains it already starts to kill the children and at least during my 
test runs it success with killing them (I think). So finding a good 
point in time to attach the debugger and see the right situation might 
not be easy?


When you say Yann's patch helps, it means especially there are not more 
SIGTERM messages in the logs resp. no more teardown checks failing?


Best regards,

Rainer

Am 06.04.24 um 17:32 schrieb jean-frederic clere:

On 4/6/24 13:10, Yann Ylavic wrote:
On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere 
 wrote:


On 4/5/24 07:55, Ruediger Pluem wrote:


Are you able to provide a stacktrace of the hanging process (thread 
apply all bt full)?


It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
older version of fedora will return 3 (I have tried fc28)


If pthread_kill() does not work we probably should use the global
"dying" variable like in mpm_event.
But it's not clear from your earlier "bt full" whether there are other
threads, could you try "thread apply all bt full" instead to show all
the threads?


(gdb) thread apply all bt full

Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)):
#0  0x7ffbf429b087 in __GI___select (nfds=nfds@entry=0, 
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at 
../sysdeps/unix/sysv/linux/select.c:69

     sc_ret = -4
     sc_cancel_oldtype = 0
     sc_ret = 
     s = 
     us = 
     ns = 
     ts64 = {tv_sec = 0, tv_nsec = 155950744}
     pts64 = 0x7fff56cbb050
     r = 
#1  0x7ffbf43d9d92 in apr_sleep (t=t@entry=50) at 
time/unix/time.c:249

     tv = {tv_sec = 0, tv_usec = 50}
#2  0x00440733 in join_workers (listener=0x87c170, 
threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069

     iter = 7
     i = 
     rv = 
     thread_rv = 0
#3  0x004412d9 in child_main 
(child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0) 
at worker.c:1310

     threads = 0x91e150
     rv = 1
     ts = 0x815a78
     thread_attr = 0x815a98
     start_thread_id = 0x815b08
     i = 
#4  0x0044161a in make_child (s=0x818d00, slot=slot@entry=0, 
bucket=0) at worker.c:1376

     pid = 0
#5  0x004416be in startup_children (number_to_start=3) at 
worker.c:1403

     i = 0
#6  0x004428f9 in worker_run (_pconf=, 
plog=0x81b998, s=0x818d00) at worker.c:1928

     listen_buckets = 0x875480
     num_buckets = 1
     remaining_children_to_start = 
     rv = 
     id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000"
     i = 
#7  0x00456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8, 
plog=0x81b998, s=0x818d00) at mpm_common.c:102

     pHook = 
     n = 0
     rv = -1
#8  0x0043350e in main (argc=, argv=out>) at main.c:882

     c = 102 'f'
     showcompile = 
--Type  for more, q to quit, c to continue without paging--
     showdirectives = 
     confname = 
     def_server_root = 
     temp_error_log = 
     error = 
     process = 0x7ea4c8
     pconf = 0x7ec3e8
     plog = 0x81b998
     ptemp = 0x815678
     pcommands = 
     opt = 0x810ef0
     rv = 
     mod = 
     opt_arg = 0x7fff56cbcb64 
"/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"

     signal_server = 
     rc = 
(gdb)

I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the 
ap_log_error() as it is not easy to get a core otherwise.




It's clear from the main thread's backtrace that it's waiting for the
listener in the "iter" loop, but nothing tells if the listener already
exited or not. The listener for instance could be waiting indefinitely
apr_pollset_poll() at this point, and since there is no pollset wakeup
in mpm_worker I don't think that wakeup_listener() can help here.


According to my tests using assert(0) in the join_workers() in different 
location, the listener thread is stopped by wakeup_listener() but the 
pthread_kill() doesn't report that.




So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
too, like in mpm_event too.

Overall something like the attached patch?


Yes the attached patch helps




Regards;
Yann.


Re: pytest results for 2.4.59

2024-04-06 Thread jean-frederic clere

On 4/6/24 13:10, Yann Ylavic wrote:

On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere  wrote:


On 4/5/24 07:55, Ruediger Pluem wrote:


Are you able to provide a stacktrace of the hanging process (thread apply all 
bt full)?


It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
older version of fedora will return 3 (I have tried fc28)


If pthread_kill() does not work we probably should use the global
"dying" variable like in mpm_event.
But it's not clear from your earlier "bt full" whether there are other
threads, could you try "thread apply all bt full" instead to show all
the threads?


(gdb) thread apply all bt full

Thread 1 (Thread 0x7ffbf3f5ad40 (LWP 2891875)):
#0  0x7ffbf429b087 in __GI___select (nfds=nfds@entry=0, 
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff56cbb0b0) at 
../sysdeps/unix/sysv/linux/select.c:69

sc_ret = -4
sc_cancel_oldtype = 0
sc_ret = 
s = 
us = 
ns = 
ts64 = {tv_sec = 0, tv_nsec = 155950744}
pts64 = 0x7fff56cbb050
r = 
#1  0x7ffbf43d9d92 in apr_sleep (t=t@entry=50) at 
time/unix/time.c:249

tv = {tv_sec = 0, tv_usec = 50}
#2  0x00440733 in join_workers (listener=0x87c170, 
threads=threads@entry=0x91e150, mode=mode@entry=2) at worker.c:1069

iter = 7
i = 
rv = 
thread_rv = 0
#3  0x004412d9 in child_main 
(child_num_arg=child_num_arg@entry=0, child_bucket=child_bucket@entry=0) 
at worker.c:1310

threads = 0x91e150
rv = 1
ts = 0x815a78
thread_attr = 0x815a98
start_thread_id = 0x815b08
i = 
#4  0x0044161a in make_child (s=0x818d00, slot=slot@entry=0, 
bucket=0) at worker.c:1376

pid = 0
#5  0x004416be in startup_children (number_to_start=3) at 
worker.c:1403

i = 0
#6  0x004428f9 in worker_run (_pconf=, 
plog=0x81b998, s=0x818d00) at worker.c:1928

listen_buckets = 0x875480
num_buckets = 1
remaining_children_to_start = 
rv = 
id = "0\000\000\000\000\000\000\000\t\000\000\000\000\000\000"
i = 
#7  0x00456930 in ap_run_mpm (pconf=pconf@entry=0x7ec3e8, 
plog=0x81b998, s=0x818d00) at mpm_common.c:102

pHook = 
n = 0
rv = -1
#8  0x0043350e in main (argc=, argv=out>) at main.c:882

c = 102 'f'
showcompile = 
--Type  for more, q to quit, c to continue without paging--
showdirectives = 
confname = 
def_server_root = 
temp_error_log = 
error = 
process = 0x7ea4c8
pconf = 0x7ec3e8
plog = 0x81b998
ptemp = 0x815678
pcommands = 
opt = 0x810ef0
rv = 
mod = 
opt_arg = 0x7fff56cbcb64 
"/home/jfclere/httpd-trunk/test/pyhttpd/../gen/apache/conf/httpd.conf"

signal_server = 
rc = 
(gdb)

I have added a kill(pid, SIGABRT); in server/mpm_unix.c after the 
ap_log_error() as it is not easy to get a core otherwise.




It's clear from the main thread's backtrace that it's waiting for the
listener in the "iter" loop, but nothing tells if the listener already
exited or not. The listener for instance could be waiting indefinitely
apr_pollset_poll() at this point, and since there is no pollset wakeup
in mpm_worker I don't think that wakeup_listener() can help here.


According to my tests using assert(0) in the join_workers() in different 
location, the listener thread is stopped by wakeup_listener() but the 
pthread_kill() doesn't report that.




So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
too, like in mpm_event too.

Overall something like the attached patch?


Yes the attached patch helps




Regards;
Yann.


--
Cheers

Jean-Frederic



Re: pytest results for 2.4.59

2024-04-06 Thread Yann Ylavic
On Sat, Apr 6, 2024 at 10:46 AM jean-frederic clere  wrote:
>
> On 4/5/24 07:55, Ruediger Pluem wrote:
> >
> > Are you able to provide a stacktrace of the hanging process (thread apply 
> > all bt full)?
>
> It seems pthread_kill(t, 0) returns 0 even the thread t has exited...
> older version of fedora will return 3 (I have tried fc28)

If pthread_kill() does not work we probably should use the global
"dying" variable like in mpm_event.
But it's not clear from your earlier "bt full" whether there are other
threads, could you try "thread apply all bt full" instead to show all
the threads?
It's clear from the main thread's backtrace that it's waiting for the
listener in the "iter" loop, but nothing tells if the listener already
exited or not. The listener for instance could be waiting indefinitely
apr_pollset_poll() at this point, and since there is no pollset wakeup
in mpm_worker I don't think that wakeup_listener() can help here.
So maybe we need to add an apr_pollset_wakeup() in wakeup_listener()
too, like in mpm_event too.

Overall something like the attached patch?


Regards;
Yann.
Index: server/mpm/worker/worker.c
===
--- server/mpm/worker/worker.c	(revision 1916768)
+++ server/mpm/worker/worker.c	(working copy)
@@ -125,10 +125,11 @@ static int max_workers = 0;
 static int server_limit = 0;
 static int thread_limit = 0;
 static int had_healthy_child = 0;
-static int dying = 0;
+static volatile int dying = 0;
 static int workers_may_exit = 0;
 static int start_thread_may_exit = 0;
 static int listener_may_exit = 0;
+static int listener_is_wakeable = 0; /* Pollset supports APR_POLLSET_WAKEABLE */
 static int requests_this_child;
 static int num_listensocks = 0;
 static int resource_shortage = 0;
@@ -272,6 +273,15 @@ static void close_worker_sockets(void)
 static void wakeup_listener(void)
 {
 listener_may_exit = 1;
+
+/* Unblock the listener if it's poll()ing */
+if (worker_pollset && listener_is_wakeable) {
+apr_pollset_wakeup(worker_pollset);
+}
+
+/* unblock the listener if it's waiting for a worker */
+ap_queue_info_term(worker_queue_info);
+
 if (!listener_os_thread) {
 /* XXX there is an obscure path that this doesn't handle perfectly:
  * right after listener thread is created but before
@@ -280,10 +290,6 @@ static void wakeup_listener(void)
  */
 return;
 }
-
-/* unblock the listener if it's waiting for a worker */
-ap_queue_info_term(worker_queue_info);
-
 /*
  * we should just be able to "kill(ap_my_pid, LISTENER_SIGNAL)" on all
  * platforms and wake up the listener thread since it is the only thread
@@ -716,6 +722,7 @@ static void * APR_THREAD_FUNC listener_thread(apr_
 ap_close_listeners_ex(my_bucket->listeners);
 ap_queue_info_free_idle_pools(worker_queue_info);
 ap_queue_term(worker_queue);
+
 dying = 1;
 ap_scoreboard_image->parent[process_slot].quiescing = 1;
 
@@ -861,6 +868,10 @@ static void create_listener_thread(thread_starter
 static void setup_threads_runtime(void)
 {
 ap_listen_rec *lr;
+int pollset_flags, i;
+const int good_methods[] = { APR_POLLSET_KQUEUE,
+ APR_POLLSET_PORT,
+ APR_POLLSET_EPOLL };
 apr_status_t rv;
 
 /* All threads (listener, workers) and synchronization objects (queues,
@@ -894,9 +905,31 @@ static void setup_threads_runtime(void)
 }
 
 /* Create the main pollset */
-rv = apr_pollset_create(&worker_pollset, num_listensocks, pruntime,
-APR_POLLSET_NOCOPY);
+pollset_flags = APR_POLLSET_NOCOPY | APR_POLLSET_NODEFAULT | APR_POLLSET_WAKEABLE;
+for (i = 0; i < sizeof(good_methods) / sizeof(good_methods[0]); i++) {
+rv = apr_pollset_create_ex(&worker_pollset, num_listensocks, pruntime,
+   pollset_flags, good_methods[i]);
+if (rv == APR_SUCCESS) {
+listener_is_wakeable = 1;
+break;
+}
+}
 if (rv != APR_SUCCESS) {
+pollset_flags &= ~APR_POLLSET_WAKEABLE;
+for (i = 0; i < sizeof(good_methods) / sizeof(good_methods[0]); i++) {
+rv = apr_pollset_create_ex(&worker_pollset, num_listensocks, pruntime,
+   pollset_flags, good_methods[i]);
+if (rv == APR_SUCCESS) {
+break;
+}
+}
+}
+if (rv != APR_SUCCESS) {
+pollset_flags &= ~APR_POLLSET_NODEFAULT;
+rv = apr_pollset_create(&worker_pollset, num_listensocks, pruntime,
+pollset_flags);
+}
+if (rv != APR_SUCCESS) {
 ap_log_error(APLOG_MARK, APLOG_EMERG, rv, ap_server_conf, APLOGNO(03285)
  "Couldn't create pollset in thread;"
  " check system or user limits");
@@ -1031,19 +1064,17 @@ static void join_workers(apr_threa

Re: pytest results for 2.4.59

2024-04-06 Thread jean-frederic clere

On 4/5/24 07:55, Ruediger Pluem wrote:



On 4/5/24 12:59 AM, Rainer Jung wrote:

I think I fixed all test failures, hopefully in the correct way. More eyes 
welcome.

I have a few additional sporadic ERRORS:

A] ERROR during teardown check for log file errors or warnings (twice):

04.04.2024 21:14:42.205465 ___ ERROR at teardown of 
TestStatus.test_md_920_020 
04.04.2024 21:14:42.205465 ERROR 
modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...

04.04.2024 21:14:42.205465 E   AssertionError: apache logged 1 errors and 0 
warnings:
04.04.2024 21:14:42.205465 E [Thu Apr 04 21:12:29.381511 2024] 
[md:error] [pid 4169] (22)Invalid argument: no certificates
in non-empty chain 
/path/to/gen/apache/md/staging/one.test.test-md-702-070-1712257797.org/pubcert.pem


04.04.2024 21:03:26.382051 ___ ERROR at teardown of 
TestStatus.test_md_920_020 
04.04.2024 21:03:26.382360 ERROR 
modules/md/test_920_status.py::TestStatus::test_md_920_020 - AssertionE...

04.04.2024 21:03:26.382051 E   AssertionError: apache logged 1 errors and 1 
warnings:
04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924286 2024] 
[md:error] [pid 8717:tid 139629962274560] (20014)Internal
error (specific information not available): test-md-702-041-1712256790.org: 
asked to retrieve chain, but no order in context
04.04.2024 21:03:26.382051 E [Thu Apr 04 21:00:48.924229 2024] 
[md:warn] [pid 8717:tid 139629962274560] error generate
pkey RSA 3072

B] Hanging httpd child processes

This happens only on RHEL 9 with worker MPM and can be notices by a dramatic 
slowdown of the tests. There's a lot of messages

AH00045: child process 1067703 still did not exit, sending a SIGTERM

and

AH00276: the listener thread didn't exit

It happened in

modules/core/test_001_encoding.py::TestEncoding::test_core_001_20[test2-/10%25abnormal.txt-200]

modules/md/test_920_status.py::TestStatus::test_md_920_020

modules/proxy/test_02_unix.py::TestProxyUds::test_proxy_02_003[mixed-500]

but I don't know, whether it might happen elsewhere also, because it is 
sporadic.

What I see in the error logs for one hanging child process:

- most threads terminate with

[Thu Apr 04 22:42:59.617953 2024] [ssl:trace3] [pid 1067703:tid 
140619680433728] ssl_engine_kernel.c(2223): [client
127.0.0.1:40686] OpenSSL: Write: SSL negotiation finished successfully
[Thu Apr 04 22:42:59.617972 2024] [ssl:trace6] [pid 1067703:tid 
140619680433728] ssl_engine_io.c(154): [client 127.0.0.1:40686]
bio_filter_out_write: flush
[Thu Apr 04 22:42:59.617981 2024] [ssl:debug] [pid 1067703:tid 140619680433728] 
ssl_engine_io.c(1146): [client 127.0.0.1:40686]
AH02001: Connection closed to child 0 with standard shutdown (server 
test1.tests.httpd.apache.org:443)

- watchdog thread terminates (?) with

[Thu Apr 04 22:43:00.902666 2024] [md:debug] [pid 1067703:tid 140619697219136] 
md_reg.c(1163): test-md-810-003a-1712260944.org:
staging done
[Thu Apr 04 22:43:00.903951 2024] [md:notice] [pid 1067703:tid 140619697219136] 
AH10059: The Managed Domain
test-md-810-003a-1712260944.org has been setup and changes will be activated on 
next (graceful) server restart.
[Thu Apr 04 22:43:00.904418 2024] [md:debug] [pid 1067703:tid 140619697219136] 
mod_md_drive.c(229): AH10107: next run in 11 hours
59 minutes 58 seconds
[Thu Apr 04 22:43:01.204981 2024] [md:debug] [pid 1067703:tid 140619697219136] 
mod_md_drive.c(236): AH10058: md watchdog stopping
[Thu Apr 04 22:43:01.205094 2024] [watchdog:debug] [pid 1067703:tid 
140619697219136] mod_watchdog.c(257): AH02973: Singleton
Watchdog (_md_renew_) stopping

- one worker thread seems not to stop:

[Thu Apr 04 22:42:59.768569 2024] [core:trace5] [pid 1067703:tid 
140619672041024] protocol.c(714): [client 127.0.0.1:48748]
Request received from client: GET 
/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E HTTP/1.1
[Thu Apr 04 22:42:59.768667 2024] [md:debug] [pid 1067703:tid 140619672041024] 
mod_md.c(1385): [client 127.0.0.1:48748] loading
challenge for test-md-810-003a-1712260944.org 
(/.well-known/acme-challenge/3VAiCadJ5do2TuwIbbh3w2foMGfnCspnm0eYejBSC9E)
[Thu Apr 04 22:42:59.768698 2024] [http:trace3] [pid 1067703:tid 
140619672041024] http_filters.c(1141): [client 127.0.0.1:48748]
Response sent with status 200, headers:
[Thu Apr 04 22:42:59.768706 2024] [http:trace5] [pid 1067703:tid 
140619672041024] http_filters.c(1150): [client 127.0.0.1:48748]
Date: Thu, 04 Apr 2024 20:42:59 GMT
[Thu Apr 04 22:42:59.768712 2024] [http:trace5] [pid 1067703:tid 
140619672041024] http_filters.c(1153): [client 127.0.0.1:48748]
Server: Apache/2.4.59 (Unix) OpenSSL/3.1.5
[Thu Apr 04 22:42:59.768718 2024] [http:trace4] [pid 1067703:tid 
140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
Content-Length: 88
[Thu Apr 04 22:42:59.768724 2024] [http:trace4] [pid 1067703:tid 
140619672041024] http_filters.c(971): [client 127.0.0.1:48748]
Connection: clo