We have seen a few reports (eg from Hervé Piedvache) of VACUUM FULL
in 7.2 producing messages like

dbfr=# VACUUM FULL VERBOSE ANALYZE pg_class ;
NOTICE:  --Relation pg_class--
NOTICE:  Rel pg_class: Uninitialized page 9 - fixing
NOTICE:  Rel pg_class: Uninitialized page 10 - fixing
NOTICE:  Rel pg_class: Uninitialized page 11 - fixing
NOTICE:  Rel pg_class: Uninitialized page 12 - fixing
NOTICE:  Rel pg_class: Uninitialized page 13 - fixing
NOTICE:  Rel pg_class: Uninitialized page 14 - fixing
NOTICE:  Rel pg_class: Uninitialized page 15 - fixing
NOTICE:  Rel pg_class: Uninitialized page 16 - fixing
NOTICE:  Rel pg_class: Uninitialized page 17 - fixing
NOTICE:  Rel pg_class: Uninitialized page 18 - fixing
NOTICE:  Rel pg_class: Uninitialized page 19 - fixing
NOTICE:  Rel pg_class: Uninitialized page 20 - fixing
NOTICE:  Rel pg_class: Uninitialized page 21 - fixing
NOTICE:  Rel pg_class: Uninitialized page 22 - fixing
NOTICE:  Rel pg_class: Uninitialized page 23 - fixing
...

I had originally suspected hardware problems, but Hervé told me today
that he was still seeing this behavior after moving to a new machine.
So I went digging for an explanation --- and I found one.  I've been
able to reproduce the above behavior by issuing repeated table creations
in one backend while another backend does occasional VACUUM FULLs on
pg_class.

The fundamental problem is that for nailed-in-cache relations like
pg_class, RelationClearRelation() does not want to release the cache
entry.  In 7.2 it doesn't do anything except close the smgr file for
the relation and return.  But RelationClearRelation is what gets called
to implement a relcache flush from an SI message.  This means that
nothing much happens in other backends when a VACUUM transmits a
relcache flush message for a nailed-in-cache relation.  In particular,
they fail to update their rd_targblock and rd_nblocks fields.  So the
scenario goes like this:

1. Backend A has done a lot of inserts/deletes in pg_class.  Its
rd_targblock field points out somewhere near the end of the table.

2. Backend B does a VACUUM FULL, gets rid of lots of space, and shrinks
pg_class.

3. Backend A does nothing in response to B's SI message, so its
rd_targblock field now points past the end of the table.

4. Backend A now tries to insert another pg_class row.  In
RelationGetBufferForTuple(), it reads the rd_targblock page, locks it,
checks it for free space.  md.c will allow the read to occur even though
it's past current EOF of the table; it will return a zeroed page.  The
check for free space will act as though there is zero free space
available, so RelationGetBufferForTuple releases the buffer and goes to
find another page where there's space.  No problem ... yet.

5. The trouble is that the bufmgr now has a live buffer for a page
that's past the end of pg_class.  What's more, it thinks the page is
dirty (because the mere act of obtaining an exclusive buffer lock on
the page sets cntxDirty).  Eventually, the bufmgr will want to recycle
that buffer for some other use, and at that point it writes out the
buffer.  Presto, a page of zeroes.  In fact possibly many pages of
zeroes --- if the rd_targblock was more than one block past the new
actual EOF, standard Unixen will accept the write and will silently
fill the intervening file space with zeroes (or make it look like 
they did, anyway).

There isn't any serious consequence of this problem, other than that the
next VACUUM will issue some "Uninitialized page" messages, so I'm not
feeling that we need a 7.2.4 to fix it in the 7.2 series.  But it needs
to be fixed.

The good news is that it is partly fixed already in 7.3, because in 7.3
RelationClearRelation does reset rd_targblock for nailed-in relations.
So I believe the problem cannot occur in this form anymore.  But I am
also thinking that it's a really bad idea for mdread to allow reads from
beyond EOF --- that's just asking for trouble.  Can anyone see a reason
not to remove the special-case at line 440 in md.c?

It'd probably also be a good idea to decouple setting cntxDirty from
acquiring exclusive buffer lock.  As things stand, when
RelationGetBufferForTuple finds there's not enough space on a target
page, it's still set cntxDirty, thereby triggering an unnecessary write
of that page.  In many cases the page would be dirty already, but it's
ugly nonetheless ... and it is a contributing factor in this bug.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Reply via email to