Re: [HACKERS] Strange hanging bug in a simple milter
* Vesa-Matti J Kari (vmk...@cc.helsinki.fi) wrote: > Many thanks to all who contributed to the fix. Great! Thanks for the report and the testing. Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
Hello, On Mon, 23 Sep 2013, Stephen Frost wrote: > I've now committed a fix for this issue. I cloned the 9.4devel branch and linked my authmilter and a test program (based on Heikki's earlier design) against the libpq that comes with it. After hours of pretty extensive stress testing using 2, 3, 4, 5, 6, 50, 100 and 500 parallel threads, I have observed no deadlocks. Everything runs smoothly. Many thanks to all who contributed to the fix. Regards, vmk -- Tietotekniikkakeskus / Helsingin yliopisto IT department / University of Helsinki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
Vesa-Matti J Kari, I've now committed a fix for this issue. If you have opportunity to, it'd be great to pull down the latest git (for whichever supported branch you'd like) and give it a try. Otherwise, the fix should be out with our next round of point releases (which I expect will be 'soon'). Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
* Andres Freund (and...@2ndquadrant.com) wrote: > That patch looks wrong to me. Note that the if (conn->ssl) branch resets > conn->ssl to NULL. huh, it figures one would overlook the simplest things. Of course it's not locking up now- we never remove the hooks (as my original patch was doing :). That's what I get for trying to go to meetings and code at the same time. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote: > Actually, I think there's a pre-existing bug there in git master. If > the SSL_set_app_data or SSL_set_fd call in pqsecure_open_client > fails for some reason, it will call close_SSL() with conn->ssl > already set, and the mutex held. close_SSL() will call > pqsecure_destroy()->destroy_SSL()->destory_ssl_system(), which will > try to acquire the mutex again. Yeah, good point, and that one looks like my fault. Moving it after the unlock should address that tho, no? Looking through the other close_SSL() calls, I don't see any other situations where it might be called with the lock held. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
On 2013-09-13 15:03:31 -0400, Stephen Frost wrote: > * Andres Freund (and...@2ndquadrant.com) wrote: > > It seems slightly cleaner to just move the pqsecure_destroy(); to the > > end of that function, based on a boolean. But if you think otherwise, I > > won't protest... > > Hmm, agreed; I had originally been concerned that the SIGPIPE madness > needed to be around the pqsecure_destroy() call, but I can't see why > that would be. > > I've run through a few times w/ the attached and haven't seen the > deadlock. Will continue testing, of course. That patch looks wrong to me. Note that the if (conn->ssl) branch resets conn->ssl to NULL. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
* Andres Freund (and...@2ndquadrant.com) wrote: > It seems slightly cleaner to just move the pqsecure_destroy(); to the > end of that function, based on a boolean. But if you think otherwise, I > won't protest... Hmm, agreed; I had originally been concerned that the SIGPIPE madness needed to be around the pqsecure_destroy() call, but I can't see why that would be. I've run through a few times w/ the attached and haven't seen the deadlock. Will continue testing, of course. Heikki, any thoughts regarding this change? Any concerns about back-patching it? Thanks, Stephen diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c new file mode 100644 index 3bd0113..ca2c5e4 100644 *** a/src/interfaces/libpq/fe-secure.c --- b/src/interfaces/libpq/fe-secure.c *** close_SSL(PGconn *conn) *** 1545,1551 SSL_shutdown(conn->ssl); SSL_free(conn->ssl); conn->ssl = NULL; - pqsecure_destroy(); /* We have to assume we got EPIPE */ REMEMBER_EPIPE(spinfo, true); RESTORE_SIGPIPE(conn, spinfo); --- 1545,1550 *** close_SSL(PGconn *conn) *** 1565,1570 --- 1564,1572 conn->engine = NULL; } #endif + + if (conn->ssl) + pqsecure_destroy(); } /* signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
On 13.09.2013 22:03, Stephen Frost wrote: * Andres Freund (and...@2ndquadrant.com) wrote: It seems slightly cleaner to just move the pqsecure_destroy(); to the end of that function, based on a boolean. But if you think otherwise, I won't protest... Hmm, agreed; I had originally been concerned that the SIGPIPE madness needed to be around the pqsecure_destroy() call, but I can't see why that would be. I've run through a few times w/ the attached and haven't seen the deadlock. Will continue testing, of course. Heikki, any thoughts regarding this change? Any concerns about back-patching it? Umm, with that patch, pqsecure_destroy() is never called. The "if (conn->ssl)" test that's now at the end of the close_SSL function is never true, because conn->ssl is set to NULL earlier. I'm afraid the "move_locks.diff" patch you posted earlier is also broken; close_SSL() is called in error scenarios from pqsecure_open_client(), while already holding the mutex. So it will deadlock with itself if the connection cannot be established. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote: > Umm, with that patch, pqsecure_destroy() is never called. The "if > (conn->ssl)" test that's now at the end of the close_SSL function is > never true, because conn->ssl is set to NULL earlier. Yeah, got ahead of myself, as Andres pointed out. > I'm afraid the "move_locks.diff" patch you posted earlier is also > broken; close_SSL() is called in error scenarios from > pqsecure_open_client(), while already holding the mutex. So it will > deadlock with itself if the connection cannot be established. Sorry, I was really just tossing it up to see if it really did avoid this particular deadlock. I'm running some tests now with the attached to see if I can get it to deadlock now. Thanks, Stephen diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c new file mode 100644 index 3bd0113..c008dcf 100644 *** a/src/interfaces/libpq/fe-secure.c --- b/src/interfaces/libpq/fe-secure.c *** open_client_SSL(PGconn *conn) *** 1537,1542 --- 1537,1544 static void close_SSL(PGconn *conn) { + bool destroy_needed = conn->ssl ? true : false; + if (conn->ssl) { DECLARE_SIGPIPE_INFO(spinfo); *** close_SSL(PGconn *conn) *** 1545,1551 SSL_shutdown(conn->ssl); SSL_free(conn->ssl); conn->ssl = NULL; - pqsecure_destroy(); /* We have to assume we got EPIPE */ REMEMBER_EPIPE(spinfo, true); RESTORE_SIGPIPE(conn, spinfo); --- 1547,1552 *** close_SSL(PGconn *conn) *** 1565,1570 --- 1566,1582 conn->engine = NULL; } #endif + + /* + * This will remove our SSL locking hooks, if this is the last SSL + * connection, which means we must wait to call it until after all + * SSL calls have been made, otherwise we can end up with a race + * condition and possible deadlocks. + * + * See comments above destroy_ssl_system(). + */ + if (destroy_needed) + pqsecure_destroy(); } /* signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
On 13.09.2013 22:26, Heikki Linnakangas wrote: I'm afraid the "move_locks.diff" patch you posted earlier is also broken; close_SSL() is called in error scenarios from pqsecure_open_client(), while already holding the mutex. So it will deadlock with itself if the connection cannot be established. Actually, I think there's a pre-existing bug there in git master. If the SSL_set_app_data or SSL_set_fd call in pqsecure_open_client fails for some reason, it will call close_SSL() with conn->ssl already set, and the mutex held. close_SSL() will call pqsecure_destroy()->destroy_SSL()->destory_ssl_system(), which will try to acquire the mutex again. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
On 2013-09-13 13:59:54 -0400, Stephen Frost wrote: > Unfortunately, while I can still easily get the deadlock to happen when > the hooks are reset, the hooks don't appear to ever get called when > ssl_open_connections is set to zero. You have a good point about the > additional SSL calls after the hooks are unloaded though, I wonder if > holding the ssl_config_mutex lock over all of close_SSL might be more > sensible.. Hm. You might want to do the check for ssl_open_connections after a memory barrier or such, it might not actually be up2date (too tired to think about the actual guarantees of x86's cache coherency guarantees). Does it burn if you move the pqsecure_destroy(); to the end of the function, after setting a boolean in the if (conn->ssl) branch? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
On 2013-09-13 14:33:25 -0400, Stephen Frost wrote: > * Stephen Frost (sfr...@snowman.net) wrote: > > * Andres Freund (and...@2ndquadrant.com) wrote: > > > Hm. close_SSL() first does pqsecure_destroy() which will unset the > > > callbacks, and the count and then goes on to do X509_free() and > > > ENGINE_finish(), ENGINE_free() if either is used. > > > > > > It's not implausible that one of those actually needs locking. I doubt > > > engines play a role here, but, without having looked at the testcase, > > > X509_free() might be a possibility. > > > > Unfortunately, while I can still easily get the deadlock to happen when > > the hooks are reset, the hooks don't appear to ever get called when > > ssl_open_connections is set to zero. You have a good point about the > > additional SSL calls after the hooks are unloaded though, I wonder if > > holding the ssl_config_mutex lock over all of close_SSL might be more > > sensible.. > > I went ahead and moved the locks to be around all of close_SSL() and > haven't been able to reproduce the deadlock, so perhaps those calls are > the issue and what's happening is that another thread is dropping or > adding the hooks in a common place while the X509_free, etc, are trying > to figure out if they should be calling the locking functions or not, > but there's a race because there's no higher-level locking happening > around those. > > Attached is a patch to move those and which doesn't deadlock for me. It seems slightly cleaner to just move the pqsecure_destroy(); to the end of that function, based on a boolean. But if you think otherwise, I won't protest... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
* Andres Freund (and...@2ndquadrant.com) wrote: > On 2013-09-13 13:15:34 -0400, Stephen Frost wrote: > > Good thought. Got sucked into a meeting but once I'm out I'll try having > > the lock/unlock routines abort if they're called while ssl_open_connections > > is zero, which should not be happening, but seems like it is. > > Hm. close_SSL() first does pqsecure_destroy() which will unset the > callbacks, and the count and then goes on to do X509_free() and > ENGINE_finish(), ENGINE_free() if either is used. > > It's not implausible that one of those actually needs locking. I doubt > engines play a role here, but, without having looked at the testcase, > X509_free() might be a possibility. Unfortunately, while I can still easily get the deadlock to happen when the hooks are reset, the hooks don't appear to ever get called when ssl_open_connections is set to zero. You have a good point about the additional SSL calls after the hooks are unloaded though, I wonder if holding the ssl_config_mutex lock over all of close_SSL might be more sensible.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
* Stephen Frost (sfr...@snowman.net) wrote: > * Andres Freund (and...@2ndquadrant.com) wrote: > > Hm. close_SSL() first does pqsecure_destroy() which will unset the > > callbacks, and the count and then goes on to do X509_free() and > > ENGINE_finish(), ENGINE_free() if either is used. > > > > It's not implausible that one of those actually needs locking. I doubt > > engines play a role here, but, without having looked at the testcase, > > X509_free() might be a possibility. > > Unfortunately, while I can still easily get the deadlock to happen when > the hooks are reset, the hooks don't appear to ever get called when > ssl_open_connections is set to zero. You have a good point about the > additional SSL calls after the hooks are unloaded though, I wonder if > holding the ssl_config_mutex lock over all of close_SSL might be more > sensible.. I went ahead and moved the locks to be around all of close_SSL() and haven't been able to reproduce the deadlock, so perhaps those calls are the issue and what's happening is that another thread is dropping or adding the hooks in a common place while the X509_free, etc, are trying to figure out if they should be calling the locking functions or not, but there's a race because there's no higher-level locking happening around those. Attached is a patch to move those and which doesn't deadlock for me. Thoughts? Thanks, Stephen diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c new file mode 100644 index 3bd0113..e14c301 100644 *** a/src/interfaces/libpq/fe-secure.c --- b/src/interfaces/libpq/fe-secure.c *** destroy_ssl_system(void) *** 1009,1017 { #ifdef ENABLE_THREAD_SAFETY /* Mutex is created in initialize_ssl_system() */ - if (pthread_mutex_lock(&ssl_config_mutex)) - return; - if (pq_init_crypto_lib && ssl_open_connections > 0) --ssl_open_connections; --- 1009,1014 *** destroy_ssl_system(void) *** 1031,1037 */ } - pthread_mutex_unlock(&ssl_config_mutex); #endif } --- 1028,1033 *** open_client_SSL(PGconn *conn) *** 1537,1542 --- 1533,1543 static void close_SSL(PGconn *conn) { + #ifdef ENABLE_THREAD_SAFETY + if (pthread_mutex_lock(&ssl_config_mutex)) + return; + #endif + if (conn->ssl) { DECLARE_SIGPIPE_INFO(spinfo); *** close_SSL(PGconn *conn) *** 1565,1570 --- 1566,1575 conn->engine = NULL; } #endif + + #ifdef ENABLE_THREAD_SAFETY + pthread_mutex_unlock(&ssl_config_mutex); + #endif } /* signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
On 2013-09-13 13:15:34 -0400, Stephen Frost wrote: > Andres, > > On Friday, September 13, 2013, Andres Freund wrote: > > > > It'd be interesting to replace the origin callbacks with one immediately > > doing an abort() or similar to see whether they maybe are called after > > they shouldn't be and from where. > > > > Good thought. Got sucked into a meeting but once I'm out I'll try having > the lock/unlock routines abort if they're called while ssl_open_connections > is zero, which should not be happening, but seems like it is. Hm. close_SSL() first does pqsecure_destroy() which will unset the callbacks, and the count and then goes on to do X509_free() and ENGINE_finish(), ENGINE_free() if either is used. It's not implausible that one of those actually needs locking. I doubt engines play a role here, but, without having looked at the testcase, X509_free() might be a possibility. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
Andres, On Friday, September 13, 2013, Andres Freund wrote: > > It'd be interesting to replace the origin callbacks with one immediately > doing an abort() or similar to see whether they maybe are called after > they shouldn't be and from where. > Good thought. Got sucked into a meeting but once I'm out I'll try having the lock/unlock routines abort if they're called while ssl_open_connections is zero, which should not be happening, but seems like it is. Thanks, Stephen
Re: [HACKERS] Strange hanging bug in a simple milter
On 2013-09-13 12:40:11 -0400, Stephen Frost wrote: > Heikki, all, > > * Stephen Frost (sfr...@snowman.net) wrote: > > Very curious. Out of time right now to look into it, but will probably > > be back at it later tonight. > > Alright, I was back at this a bit today and decided to go with a hunch- > and it looks like I might have been right to try. > > Leaving the locking callback hooks in place appears to prevent the > deadlocks from happening, at least with this little app. > > IOW, in destroy_ssl_system(), simply arrange to not have > CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL); > called. I've done this with the very simple attached patch. Per the > comment above destroy_ssl_system(), this doesn't seem to be an > acceptable solution because libpq might get unloaded from the > application, but perhaps it points the way towards what the issue is. It'd be interesting to replace the origin callbacks with one immediately doing an abort() or similar to see whether they maybe are called after they shouldn't be and from where. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
Heikki, all, * Stephen Frost (sfr...@snowman.net) wrote: > Very curious. Out of time right now to look into it, but will probably > be back at it later tonight. Alright, I was back at this a bit today and decided to go with a hunch- and it looks like I might have been right to try. Leaving the locking callback hooks in place appears to prevent the deadlocks from happening, at least with this little app. IOW, in destroy_ssl_system(), simply arrange to not have CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL); called. I've done this with the very simple attached patch. Per the comment above destroy_ssl_system(), this doesn't seem to be an acceptable solution because libpq might get unloaded from the application, but perhaps it points the way towards what the issue is. My thought had been that there was an issue with regard to signal handling, but I've been unable to confirm that, nor is it clear why that would be the case. In any case, I'm curious what others think of these results and if anyone can reproduce the deadlock with this patch applied. Thanks! Stephen diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c new file mode 100644 index 3bd0113..e025b49 100644 *** a/src/interfaces/libpq/fe-secure.c --- b/src/interfaces/libpq/fe-secure.c *** static SSL_CTX *SSL_context = NULL; *** 102,107 --- 102,108 #ifdef ENABLE_THREAD_SAFETY static long ssl_open_connections = 0; + static int initialized_hooks = 0; #ifndef WIN32 static pthread_mutex_t ssl_config_mutex = PTHREAD_MUTEX_INITIALIZER; *** init_ssl_system(PGconn *conn) *** 947,952 --- 948,955 if (ssl_open_connections++ == 0) { + initialized_hooks = 1; + /* These are only required for threaded libcrypto applications */ CRYPTO_set_id_callback(pq_threadidcallback); CRYPTO_set_locking_callback(pq_lockingcallback); *** destroy_ssl_system(void) *** 1015,1021 if (pq_init_crypto_lib && ssl_open_connections > 0) --ssl_open_connections; ! if (pq_init_crypto_lib && ssl_open_connections == 0) { /* No connections left, unregister libcrypto callbacks */ CRYPTO_set_locking_callback(NULL); --- 1018,1024 if (pq_init_crypto_lib && ssl_open_connections > 0) --ssl_open_connections; ! if (pq_init_crypto_lib && ssl_open_connections == 0 && initialized_hooks == 0) { /* No connections left, unregister libcrypto callbacks */ CRYPTO_set_locking_callback(NULL); signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
Heikki, * Heikki Linnakangas (hlinnakan...@vmware.com) wrote: > Hmm. Are you sure you're getting an SSL connection? Run it with > something like this to make sure: sslmode=require doesn't help on Unix domain connections. :) Was able to get it to lock with both 9.2.4 and master, and with both versions of the SSL library (1.0.1c-4ubuntu8.1 and 1.0.1e-3). The lock that I got showed these stacks: Thread #2: pthread_mutex_lock init_ssl_system ; fe-secure.c:917 Thread #3: pthread_mutex_lock pq_lockingcallback ... SSL_new pqsecure_open_client ; fe-secure.c:275 Thread #2 waiting at 917 makes sense, he's waiting on the lock that the other thread has on ssl_config_mutex before moving in to set up his own SSL connection. What's odd is how is thread #3 waiting on a lock in the lock array. Both threads agree that ssl_open_connections is only 1 (thread #3's ; thread #2 hasn't gotten to incrementing it yet). Looking at the lock array, only one of the locks is taken out and it's owner is thread #3, meaning that SSL apparently caused a deadlock by trying to take a lock which it's already taken. Changing the lock type to be recursive instead masks the self-locking issue, but then I got a case where, with the same stack traces as above, the lock in the array was held by thread #2 instead, where thread #2 is in init_ssl_system- well before it's even made any calls into SSL since the previous PQdestroy happened. I've also caught it where a thread is still holding a lock when it drops into destroy_ssl_system() by simply trying to unlock all of the locks in the array. With the recursive lock type, all such attempts should simply error out (either it's locked by someone else, or it's already unlocked) and so I checked for *successful* unlocks: 2: DEBUG: database connection established 2: DEBUG: about to call PQfinish() successfully unlocked mutex! Having that happen can then cause the deadlock because the other thread can end up waiting on that lock that we're still holding while in destroy_ssl_system(), where we're waiting on the ssl_config_mutex lock that the first thread has. Even leaving that code in there, which unlocks all the locks during destroy_ssl_system(), it's still deadlocked on me with the same stack trace as above, with thread #2 holding a lock in the pq_lockarray which thread #3 is trying to get (while thread #3 holds the ssl_config_mutex lock that thread #2 is waiting on). Very curious. Out of time right now to look into it, but will probably be back at it later tonight. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
On 10.09.2013 18:10, Stephen Frost wrote: I've run your test program against both git master and 9.2.4 on a couple of Ubuntu 13.04 boxes and all I see are tons of these: 1: DEBUG: database connection established 1: DEBUG: about to call PQfinish() 1: DEBUG: database connection established 1: DEBUG: about to call PQfinish() 2: DEBUG: database connection established 2: DEBUG: about to call PQfinish() But no deadlocks; even across multiple runs, they all terminate normally. Hmm. Are you sure you're getting an SSL connection? Run it with something like this to make sure: ./threaded-connect "dbname=postgres sslmode=require host=localhost" For 9.2.4, I've got: libpq5 9.2.4-1.pgdg12.4+1 libssl1.0.0:amd64 1.0.1c-4ubuntu8.1 With master, it's: libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319 libssl1.0.0:amd64 1.0.1c-4ubuntu8.1 What versions of libssl are you working with..? Is it possible this has actually been fixed already? $ dpkg -s libssl1.0.0:amd64 | grep Vers Version: 1.0.1e-3 - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
Heikki, * Heikki Linnakangas (hlinnakan...@vmware.com) wrote: > Thanks! I tested with git master. I've run your test program against both git master and 9.2.4 on a couple of Ubuntu 13.04 boxes and all I see are tons of these: 1: DEBUG: database connection established 1: DEBUG: about to call PQfinish() 1: DEBUG: database connection established 1: DEBUG: about to call PQfinish() 2: DEBUG: database connection established 2: DEBUG: about to call PQfinish() But no deadlocks; even across multiple runs, they all terminate normally. For 9.2.4, I've got: libpq5 9.2.4-1.pgdg12.4+1 libssl1.0.0:amd64 1.0.1c-4ubuntu8.1 With master, it's: libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319 libssl1.0.0:amd64 1.0.1c-4ubuntu8.1 What versions of libssl are you working with..? Is it possible this has actually been fixed already? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > Heikki Linnakangas wrote: > > I'll dig into that, but right now it seems like an OpenSSL or > > libcrypto bug to me. Or something in the way we use them, although I > > can't see anything obviously wrong in the libpq code at a quick > > glance. > > Can you please try with ssl_renegotiation_limit=0? > > [ looks ] Uh, actually you don't even send data in those connections in > your test program, do you? Maybe there's a problem with the mutex stuff > committed recently by Stephen. I was wondering about that also, but it was apparently an issue even before that change (it was reported against 9.1.9). Also, Heikki's analysis appears to show cases where two threads end up waiting on the same entry in the lockarray, which I don't think my changes would have impacted at all. In any case, I hope to find time this afternoon/evening to try this against libpq from before and after, just to be sure and rule out that patch. Assuming that pans out, I tend to agree w/ Heikki that we should test this outside of libpq entirely and see if we can reproduce it. Even if we're able to do that, we may need to consider ways to fix it ourselves (perhaps be holding heavier locks or something), as we have no idea how long it'll take an OpenSSL fix to happen.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
Heikki Linnakangas wrote: > I'll dig into that, but right now it seems like an OpenSSL or > libcrypto bug to me. Or something in the way we use them, although I > can't see anything obviously wrong in the libpq code at a quick > glance. Can you please try with ssl_renegotiation_limit=0? [ looks ] Uh, actually you don't even send data in those connections in your test program, do you? Maybe there's a problem with the mutex stuff committed recently by Stephen. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
On 09.09.2013 18:20, Stephen Frost wrote: Vesa-Matti, Heikki, * Heikki Linnakangas (hlinnakan...@vmware.com) wrote: On 09.09.2013 15:36, Vesa-Matti J Kari wrote: If I interpret this correctly, threads #2 and #3 are waiting for the same lock but they make no progress. A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL enabled in my test server. As soon as I turned it on, it hung. Attached is a small stand-alone test program to reproduce it. You can pass a libpq connection string as argument to it. Interesting... Which version of libpq were you working against? I see that Vesa-Matti had the problem happen w/ 9.1.9, which should have been before the changes that I made to add locking around our usage of SSL_context, as otherwise we would end up in situations where we'd dump core, but he also had it with 9.3rc1, which should have included it. I had tested the patch w/ a pretty good amount of concurrent threads fired off from a little python script and didn't run into any deadlocks there.. Vesa-Matti, was this working previously, and are you sure you were testing with 9.3rc1's libpq? Heikki, which are you testing against and perhaps you might try before and after? I'll be able to look into it more in a few hours also. Thanks! I tested with git master. I added printf()s into the pq_lockingcallback function, and got a trace where both threads got stuck waiting for lock 10 in the pq_lockarray. It looks like someone is failing to release it. The backtrace for both threads look like this: #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x7fad49826f3c in _L_lock_974 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x7fad49826d8b in __GI___pthread_mutex_lock (mutex=0x7fad3800a260) at pthread_mutex_lock.c:64 #3 0x7fad49a53f08 in pq_lockingcallback (line=175, file=0x7fad48fa8bb3 "x_pubkey.c", n=10, mode=) at fe-secure.c:872 #4 pq_lockingcallback (mode=, n=10, file=0x7fad48fa8bb3 "x_pubkey.c", line=175) at fe-secure.c:868 #5 0x7fad48f396ab in X509_PUBKEY_get () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x7fad48f56292 in X509_get_pubkey_parameters () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x7fad48f5649c in X509_verify_cert () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x7fad4924f14a in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #9 0x7fad4922ce1c in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #10 0x7fad492310d2 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #11 0x7fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at fe-secure.c:1463 #12 pqsecure_open_client (conn=conn@entry=0x7fad380397d0) at fe-secure.c:306 #13 0x7fad49a44fb6 in PQconnectPoll (conn=conn@entry=0x7fad380397d0) at fe-connect.c:2123 #14 0x7fad49a4618e in connectDBComplete (conn=conn@entry=0x7fad380397d0) at fe-connect.c:1521 #15 0x7fad49a46b47 in PQconnectdb (conninfo=) at fe-connect.c:516 #16 0x004007b6 in test_connect (threadid=2) at threaded-connect.c:25 #17 0x0040086e in run_thread (arg=0x600e04 ) at threaded-connect.c:55 #18 0x7fad49824e0e in start_thread (arg=0x7fad48203700) at pthread_create.c:311 #19 0x7fad4955993d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Sometimes the lockup happens differently, with one thread hung up in SSL_init() and another waiting for the first one on ssl_config_mutex, for example. A good next step might be to create a standalone program that doesn't use libpq at all, but just calls X509_verify_cert() concurrently in two threads. Or open plain SSL connections. If the deadlock can be reproduced with that, then we could just report the bug to the OpenSSL and hope that they can figure it out. - Heikki diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c index 3bd0113..3e70306 100644 --- a/src/interfaces/libpq/fe-secure.c +++ b/src/interfaces/libpq/fe-secure.c @@ -871,11 +871,15 @@ pq_lockingcallback(int mode, int n, const char *file, int line) { if (pthread_mutex_lock(&pq_lockarray[n])) PGTHREAD_ERROR("failed to lock mutex"); + printf("%ld locking callback: lock %d %s:%d\n", pthread_self(), n, file, line); + fflush(stdout); } else { if (pthread_mutex_unlock(&pq_lockarray[n])) PGTHREAD_ERROR("failed to unlock mutex"); + printf("%ld locking callback: unlock %d %s:%d\n", pthread_self(), n, file, line); + fflush(stdout); } } #endif /* ENABLE_THREAD_SAFETY */ lock-trace.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Strange hanging bug in a simple milter
On 09.09.2013 15:36, Vesa-Matti J Kari wrote: It looks like a deadlock situation of some kind... (gdb) thread 2 [Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))] #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x7fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 #3 0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5 #4 0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #5 0x7fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #9 0x7fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #11 0x7fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #12 0x7fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #13 0x7fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5 #18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5 #19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5 #20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5 #21 0x00401ba5 in authmilt_connect (ctx=0xe81b60, hostname=0x7fe628c0 "localhost", hostaddr=0x7fe62f7fdce0) at authmilter.c:212 #22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1 #23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1 #24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1 #25 0x7fe64c0aee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #27 0x in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 0x7fe62700 (LWP 27283))] #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x7fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 #3 0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5 #4 0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #5 0x7fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #9 0x7fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #11 0x7fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #12 0x7fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #13 0x7fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5 #18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5 #19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5 #20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5 #21 0x00401ba5 in authmilt_connect (ctx=0xe818e0, hostname=0x7fe6280008c0 "localhost", hostaddr=0x7fe62fffece0) at authmilter.c:212 #22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1 #23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1 #24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1 #25 0x7fe64c0aee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #27 0x in ?? () If I interpret this correctly, threads #2 and #3 are waiting for the same lock but they make no progress. A-ha, the deadlock happens while doi
Re: [HACKERS] Strange hanging bug in a simple milter
Vesa-Matti, Heikki, * Heikki Linnakangas (hlinnakan...@vmware.com) wrote: > On 09.09.2013 15:36, Vesa-Matti J Kari wrote: > >If I interpret this correctly, threads #2 and #3 are waiting for the same > >lock but they make no progress. > > A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL > enabled in my test server. As soon as I turned it on, it hung. > > Attached is a small stand-alone test program to reproduce it. You > can pass a libpq connection string as argument to it. Interesting... Which version of libpq were you working against? I see that Vesa-Matti had the problem happen w/ 9.1.9, which should have been before the changes that I made to add locking around our usage of SSL_context, as otherwise we would end up in situations where we'd dump core, but he also had it with 9.3rc1, which should have included it. I had tested the patch w/ a pretty good amount of concurrent threads fired off from a little python script and didn't run into any deadlocks there.. Vesa-Matti, was this working previously, and are you sure you were testing with 9.3rc1's libpq? Heikki, which are you testing against and perhaps you might try before and after? I'll be able to look into it more in a few hours also. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Strange hanging bug in a simple milter
Hello, On Mon, 9 Sep 2013, Heikki Linnakangas wrote: > I managed to set that up and got it running. Many thanks for taking the time. > But it works fine for me, does not hang. Okay. Have you tried increasing the iterations for the smtp sender scripts? And could you please specify what is your test environment like (i.e. OS and the related library versions)? > I'd suggest poking around with gdb, to see where it hangs. I have actually done that, but it only show the main listener thread from the libmilter library: (gdb) bt #0 0x7fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7fe64c4f7b46 in mi_listener () from /usr/lib/libmilter.so.1.0.1 #2 0x7fe64c4f8707 in smfi_main () from /usr/lib/libmilter.so.1.0.1 #3 0x00402c8f in main (argc=15, argv=0x7fffa6560e68) at authmilter.c:699 Hmmm. The man page mentioned no threads, but Google was helpful and suggested "info threads" so here goes: (I hope alpine will not wrap these long lines) (gdb) info threads Id Target Id Frame 9Thread 0x7fe64700c700 (LWP 14362) "authmilter" 0x7fe64c0b69f7 in do_sigwait () from /lib/x86_64-linux-gnu/libpthread.so.0 8Thread 0x7fe64680b700 (LWP 14363) "authmilter" 0x7fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6 7Thread 0x7fe645809700 (LWP 14365) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 6Thread 0x7fe645008700 (LWP 22404) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 5Thread 0x7fe64600a700 (LWP 27263) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 4Thread 0x7fe644807700 (LWP 27264) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 3Thread 0x7fe62700 (LWP 27283) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 2Thread 0x7fe62f7fe700 (LWP 27284) "authmilter" 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 * 1Thread 0x7fe64c8fd740 (LWP 14361) "authmilter" 0x7fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6 It looks like a deadlock situation of some kind... (gdb) thread 2 [Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))] #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fe64c0b1065 in _L_lock_858 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x7fe64c0b0eba in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 #3 0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5 #4 0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #5 0x7fe64b77a915 in RSA_new_method () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #9 0x7fe64b7bd0c4 in ASN1_item_d2i () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #11 0x7fe64b7b461a in X509_PUBKEY_get () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #12 0x7fe64b7d119a in X509_get_pubkey_parameters () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #13 0x7fe64b7d1398 in X509_verify_cert () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5 #18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5 #19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5 #20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5 #21 0x00401ba5 in authmilt_connect (ctx=0xe81b60, hostname=0x7fe628c0 "localhost", hostaddr=0x7fe62f7fdce0) at authmilter.c:212 #22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1 #23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1 #24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1 #25 0x7fe64c0aee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #27 0x in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 0x7fe62700 (LWP 27283))] #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x7fe64c0b589c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7fe64c0b1065 in _L_lock_858 () from /lib/
Re: [HACKERS] Strange hanging bug in a simple milter
On 09.09.2013 09:34, Vesa-Matti J Kari wrote: Basically all that the authmilter now does is to connect to PostgreSQL in authmilt_connect() and close the connection in authmilt_close(). Based on the authmilter debug logging it seems to me that when the hanging occurs, the authmilter never completes PQsetdbLogin(). Based on the document http://www.postgresql.org/docs/9.1/interactive/libpq-threading.html I am sure libpq should be thread safe and on startup the authmilter verifies that the libpq is indeed thread safe. If you think you could set up a test enviroment using: - Sendmail (or maybe Postfix) - authmilter - PostgreSQL here is an authmilter-simplied.tar.gz package for you: http://www.helsinki.fi/~vmkari/authmilter-simplified.tar.gz The README file contains a rough instructions outline on how to setup things in order to reproduce the strange hanging bug. Please note that when running two test message sender scripts in parallel, the bug does not occur immediately, but only after between 1 to 5 minutes of processing. Sometimes it may take even longer. I managed to set that up and got it running. But it works fine for me, does not hang. I'd suggest poking around with gdb, to see where it hangs. Also, run "select * from pg_stat_activity" from a psql session to see what's happening inside the database. log_connections=on and log_disconnections=on would also be a good idea. PS. You'll need to escape the strings in the queries, to avoid SQL injection. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Strange hanging bug in a simple milter
Hello PostgreSQL gurus, (I have already posted a very similar message to comp.mail.sendmail newsgroup on August 22nd, but I haven't received any responses there. I have also tried pgsql-interfa...@postgresql.org but to no avail. Solving this problem requires some Sendmail/Postfix experience because the MTA needs to be configured to use the authmilter. Also some basic Milter API knowledge is needed: https://www.milter.org/developers/api/index) I have come across a very strange bug and I do not understand why it is occurring. Fortunately it is reproducible in a pretty deterministic manner. The goal: to create a milter for limiting how many nrcpts an authenticated user can send in a 24 hour time interval. The related data is stored in a PostgreSQL database. The problem: my alpha-stage code for authmilter simply hangs when processing two concurrent connections from Sendmail. For authmilter-simplified, I have removed the callbacks envfrom and envrcpt, because they are not needed in order to demonstrate the bug. Basically all that the authmilter now does is to connect to PostgreSQL in authmilt_connect() and close the connection in authmilt_close(). Based on the authmilter debug logging it seems to me that when the hanging occurs, the authmilter never completes PQsetdbLogin(). Based on the document http://www.postgresql.org/docs/9.1/interactive/libpq-threading.html I am sure libpq should be thread safe and on startup the authmilter verifies that the libpq is indeed thread safe. If you think you could set up a test enviroment using: - Sendmail (or maybe Postfix) - authmilter - PostgreSQL here is an authmilter-simplied.tar.gz package for you: http://www.helsinki.fi/~vmkari/authmilter-simplified.tar.gz The README file contains a rough instructions outline on how to setup things in order to reproduce the strange hanging bug. Please note that when running two test message sender scripts in parallel, the bug does not occur immediately, but only after between 1 to 5 minutes of processing. Sometimes it may take even longer. I have tested authmilter on Ubuntu Linux 12.04 having packages: libmilter-dev 8.14.4-2ubuntu2 libmilter1.0.1 8.14.4-2ubuntu2 libpq-dev 9.1.9-0ubuntu12.04 libpq5 9.1.9-0ubuntu12.04 postgresql-9.1 9.1.9-0ubuntu12.04 postgresql-client-9.1 9.1.9-0ubuntu12.04 postgresql-client-common 129ubuntu1 postgresql-common 129ubuntu1 postgresql-contrib-9.1 9.1.9-0ubuntu12.04 postgresql-doc-9.1 9.1.9-0ubuntu12.04 postgresql-server-dev-9.1 9.1.9-0ubuntu12.04 postgresql-server-dev-all 129ubuntu1 sendmail 8.14.4-2ubuntu2 sendmail-base 8.14.4-2ubuntu2 sendmail-bin 8.14.4-2ubuntu2 sendmail-cf 8.14.4-2ubuntu2 sendmail-doc 8.14.4-2ubuntu2 I suspect there could be something wrong with libpq and libmilter working together, but I am not sure. Many thanks for any help you can provide. PS. I installed PostgreSQL 9.3rc1 and linked my milter against the libpq that comes with that version. The hanging bug still occurs. Regards, vmk -- Tietotekniikkakeskus / Helsingin yliopisto IT department / University of Helsinki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers