Re: [HACKERS] jacana hung after failing to acquire random number

Heikki Linnakangas Mon, 12 Dec 2016 06:02:53 -0800

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:



On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:

On 12/12/2016 05:58 AM, Michael Paquier wrote:

On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <[email protected]>
wrote:


jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
check". After starting the autovacuum launcher there are 120
messages on its
log about "Could not acquire random number". Then nothing.


So I suspect the problem here is commit
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
detail.


Shouldn't we want the postmaster to shut down if it's not going to go
further? Note that frogmouth, also mingw, which builds with openssl,
doesn't
have this issue.


Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15

And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though.


Actually, looking at the config.log on jacana, it's trying to use
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:

  elif test "$PORTNAME" = x"win32" ; then
    USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom,
but the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good
error message, if acquiring a random number fails for any reason. This
needs to work in the frontend as well backend, but we could still have
an elog(LOG, ...) there, inside an #ifndef FRONTEND block.



I see you have now improved the messages in postmaster.c, which is good.

Well, I only wordsmithed them a bit, it still doesn't give much clue onwhy it failed. We should add more details to it.

However, the bigger problem (ISTM) is that when this failed I had a
system which was running but where every connection immediately failed:

    ============== creating temporary instance            ==============
    ============== initializing database system           ==============
    ============== starting postmaster                    ==============

    pg_regress: postmaster did not respond within 120 seconds
    Examine 
c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log
 for the reason
    make: *** [check] Error 2

Should one or more of these errors be fatal? Or should we at least get
pg_regress to try to shut down the postmaster if it can't connect after
120 seconds?

Making it fatal, i.e. bringing down the server, doesn't seem like animprovement. If the failure is transient, you don't want to kill thewhole server, when one connection attempt fails.

It would be nice to fail earlier if it's permanently failing, though.Like, if someone does "rm /dev/urandom". Perhaps we should perform onepg_strong_random() call at postmaster startup, and if that fails, refuseto start up.


- Heikki



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jacana hung after failing to acquire random number

Reply via email to