[HACKERS] curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)

2007-09-19 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 ! ERROR:  could not read block 2 of relation 1663/16384/2606: read only 0 
 of 8192 bytes
 
 Is that repeatable?  What sort of filesystem are you testing on?
 (soft-mounted NFS by any chance?)

 doesn't seem to be repeatable :-(

Hmm ... 
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_mothdt=2007-09-19%2013:10:01

Exact same error --- is it at the same place in the tests where you saw it?

Now that I think about it, there have been similar transient failures
(read only 0 of 8192 bytes) in the buildfarm before.  It would be
helpful to collect a list of exactly which build reports contain
that string, but AFAIK there's no very easy way to do that; Andrew,
any suggestions?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)

2007-09-19 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 pgbfprod=# select sysname, stage, snapshot from build_status where log ~ 
 $$read only \d+ of \d+ bytes$$;
   sysname   |stage |  snapshot  
  ---+--+-
  zebra  | InstallCheck | 2007-09-11 10:25:03
  wildebeest | InstallCheck | 2007-09-11 22:00:11
  baiji  | InstallCheck | 2007-09-12 22:39:24
  luna_moth  | InstallCheck | 2007-09-19 13:10:01
 (4 rows)

Fascinating.  So I would venture that (1) it's definitely our bug,
not something we could blame on NFS or whatever, and (2) we introduced
it fairly recently.  That specific error message wording exists only
in HEAD, but it's been there since 2007-01-03, so if there were a
pre-existing problem you'd think there would be some more matches.

The patterns I notice here are (1) they're all InstallCheck not Check
failures; (2) though not all at the same place in the tests, it's
a fairly short range; (3) it's all references to system catalogs,
though not all the same one.

My gut feeling is that we're seeing autovacuum truncate off an empty end
block and then a backend tries to reference that block again.  But there
should be enough interlocks in place to prevent such references.  Any
ideas out there?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly