Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, The timestamp changed, but the content looks the same. -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, http://people.apache.org/~covener/trunk-proxy-segv.diff looks the same. Did you place somewhere else? -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, It's in the works. This involves several other people as the large server is owned by another group and getting our backend software ready to be called from httpd is somewhat involved. The build needs to be done by someone else as well. I've requested they build with --enable-pool-concurrency-check as that's what allows me to see which two threads are stepping on each other. I looked at your change and I see that you changed the logic so that the worker mutex is set for the first call to ap_proxy_initialize_worker(). I don't think that hurts anything and probably makes it easier to read, but I don't think it has an effect as in the first call (the one that allocates the pool), there is no way for any other thread to be using the pool. You also moved the worker mutex unlock so it only happens if worker->s->hmax != NULL and leave the two calls to connection_constructor() un-serialized. I don't have in my notes definite proof that this function must be serialized, so hopefully the test will show if this is ok. However, even in the case where worker->s->hmax == NULL, we need to unlock the worker mutex before returning. If this shows as a problem in the test, I can make that small change and run again and we should also put back the unlock call in the error condition for init_conn_pool(), as we shouldn't leave the mutex locked if we error out of this routine. Also, when this is finalized, all the new calls to APLOGNO need to go through your utility to assign message numbers. > -Original Message- > From: Eric Covener > Sent: Monday, August 12, 2019 3:25 PM > To: Apache HTTP Server Development List > Subject: Re: [PATCH 63503] - Reverse proxy server - SIGSEGV > > Hi Don, can you try this very similar patch? I applied yours to my > sandbox to get more context and made a few minor changes (including > pre-existing stuff that looked misleading) > > http://people.apache.org/~covener/trunk-proxy-segv.diff -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, I'm not sure what you're asking. The global mutex will cover the call to init_conn_pool(), so there isn't a problem with worker->cp getting overlayed if that's what you're thinking. The problem is that ap_proxy_initialize_worker() gets called repeatedly with the same worker pointer. The first time it's called for a particular worker, it creates a connection pool and the worker thread mutex. This mutex appears to only be used in ap_proxy_determine_connection() to protect setting of worker->cp->addr by apr_sockaddr_info_get() (it also protects use of worker->cp->pool, but it's not clear that was recognized as a requirement). But _every_ use of worker->cp->pool must be serialized. Subsequent calls to ap_proxy_initialize_worker() with the same worker pointer will use the pool and mutex created in the first call. It's only these calls after the first that are at risk for concurrent use of the pool (hence my use of the 'notfirst' flag). In the first call, I don't even bother getting the proxy mutex as there isn't any way for other threads to be using the pool before it's created. -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Eric, The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening: pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use 172256 . Thawed libc:gsignal (+0x37) <--- abort 1 libc:gsignal (+0x37) 2 libc:abort (+0x143) 3 libapr-1:\apr_pools\pool_concurrency_abort 768 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) 5 libapr-1:\thread_cond\apr_thread_cond_create 44 6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9) 7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F) 8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE) 9 httpd:UNKNOWN at 0x004543A0 Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called: 172271Thawed libc:__GI_strncmp (+0x1680) <-- in use 1 libc:__GI_strncmp (+0x1680) 2 libc:getenv (+0xBD) 3 libc:__nscd_getai (+0x3D3) 4 libc:gaih_inet.constprop.8 (+0x15F2) 5 libc:getaddrinfo (+0x11F) 6 libapr-1:\sockaddr\call_resolver 397 (+0x10) 7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15) 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) 11 httpd:UNKNOWN at 0x004543A0 Another possible fix would be to use the global mutex everywhere it's needed and remove the call to create and use the worker thread lock altogether, but that would mean changing the code that currently uses the worker proxy thread lock and it seemed cleaner to just change proxy_util.c rather than two source modules. The two functions in ap_proxy_acquire_connection() that I serialized were found from other runs. Both apr_reslist_acquire() and connection_constructor() end up using the same worker pool that is being used by ap_proxy_determine_connection(). e.g. pool concurrency check: pool 0x1d93198(proxy_worker_cp), thread cur 7f019bfff700 in use by 7f019abfd700, state in use -> in use aborting thread: 1Thread 0x7f019bfff700 (LWP 106622) 0x7f01a735c207 in raise () from /lib64/libc.so.6 3 libapr-1:\apr_pools\pool_concurrency_abort 768 4 libapr-1:\apr_pools\apr_palloc 778 (+0xA) 5 libaprutil-1:\apr_reslist\get_container 100 (+0x8) 6 libaprutil-1:\apr_reslist\apr_reslist_acquire 121 (+0x3) 7 mod_proxy:\proxy_util\ap_proxy_acquire_connection 2327 (+0x3) 8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1933 (+0x19) 9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) 11 httpd:UNKNOWN at 0x004543A0 And the "in-use" thread: 5Thread 0x7f019abfd700 (LWP 106624) 0x7f01a7424bed in connect () from /lib64/libc.so.6 8 mod_proxy:\proxy_util\ap_proxy_determine_connection 2537 (+0x15) 9 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D) 10 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18) 11 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16) Once the workers are up and running, they use thread-specific storage, but all the code that starts them up uses a pool that needs to be serialized. worker->cp as used in ap_proxy_acquire_connection() is currently not serialized when running with the worker mpm. It can run at the same time as ap_proxy_initialize_worker() with the same pool used in both. The problem doesn't show up when running the same test with pre-fork or event mpms. -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
RE: [PATCH 63503] - Reverse proxy server - SIGSEGV
Hello, I see that proxy_util.c was updated a few days ago. Would it be considered a faux-pas to contact the developer directly to request a review of this patch? It's been almost 2 months and I'm worried it will just get lost in the noise. -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513
[PATCH 63503] - Reverse proxy server - SIGSEGV
Hello. I submitted this patch last week and I'd just like to solicit comments on how to proceed. See this for the text of the bug and fix: https://bz.apache.org/bugzilla/show_bug.cgi?id=63503 Unfortunately, I don't have a test that exhibits the bug that I can ask others to run. We were only able to see it when running a Loadrunner script emulating thousands of concurrent users. As I see it, there are two potential problems in the fix: 1) adding mutex() calls will single-thread code that previously was able to run on multiple processors and so imposes a performance hit. 2) There's a possibility of "deadly embrace" if some code path uses the locks in an order not seen in our tests. I think #2 would become evident pretty quickly if we were able to get a test run that exercises the mod_proxy code more fully than our one server. #1 I think is just overhead required for thread-safety. It would be nice if all apr-pools were required to only be accessed by a single thread, but if they aren't, then they must be serialized externally when there is the possibility of concurrent access. The presence of the "--enable-pool-concurrency-check" made fixing this problem a lot easier. I couldn't find a previous report of its use, but obviously someone thought it would be helpful. Let me know if there are questions I can answer about the patch or the core files that the bug produced. Thanks! -- Don Poitras - Host R&DSAS Institute Inc.SAS Campus Drive mailto:sas...@sas.com (919)531-5637 Fax:677- Cary, NC 27513