I wrote:
> ... lookee what we have here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-24%2004:00:07

Modifying the code to print the actual timestamps resulted in a wealth
of information that can be dredged from the postmaster log here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-25%2004:00:04
Just grep for log messages like
[4baafb13.cd2:1] LOG:  last_statrequest 2010-03-25 06:58:42.118734+01 is later 
than collector's time 2010-03-25 06:58:40.817027+01

What this shows is that system timekeeping on jaguar is just utterly
broken.  gettimeofday() sometimes returns answers that are 5 seconds or
more skewed in different processes, and there are episodes where its
value fails to move at all in the stats collector while multiple seconds
elapse in the opinion of another process.

I wonder whether this doesn't also explain jaguar's propensity to show
some other unexplainable failures, like these recent examples:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-20%2004:00:03
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-15%2004:00:04
which are not at all unique in that machine's buildfarm history.

I had always supposed that these were caused by the extreme overall
slowness induced by CLOBBER_CACHE_ALWAYS, but now I think they are
more likely explained by system timekeeping issues.  In particular the
last-cited regression failure can be explained if now() fails to change
value across pg_sleep(0.1), and that's isomorphic to what seems to be
happening in the stats collector in some of the episodes captured today.

In short: either this machine has got broken clock hardware, or there's
a very serious bug in the kernel code for reading the clock.

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to