I wrote: > ... lookee what we have here: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-24%2004:00:07
Modifying the code to print the actual timestamps resulted in a wealth of information that can be dredged from the postmaster log here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-25%2004:00:04 Just grep for log messages like [4baafb13.cd2:1] LOG: last_statrequest 2010-03-25 06:58:42.118734+01 is later than collector's time 2010-03-25 06:58:40.817027+01 What this shows is that system timekeeping on jaguar is just utterly broken. gettimeofday() sometimes returns answers that are 5 seconds or more skewed in different processes, and there are episodes where its value fails to move at all in the stats collector while multiple seconds elapse in the opinion of another process. I wonder whether this doesn't also explain jaguar's propensity to show some other unexplainable failures, like these recent examples: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-20%2004:00:03 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2010-03-15%2004:00:04 which are not at all unique in that machine's buildfarm history. I had always supposed that these were caused by the extreme overall slowness induced by CLOBBER_CACHE_ALWAYS, but now I think they are more likely explained by system timekeeping issues. In particular the last-cited regression failure can be explained if now() fails to change value across pg_sleep(0.1), and that's isomorphic to what seems to be happening in the stats collector in some of the episodes captured today. In short: either this machine has got broken clock hardware, or there's a very serious bug in the kernel code for reading the clock. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers