On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:
> Jeff Janes wrote:
> 
> > 12538  2014-06-17 12:10:02.925 PDT:LOG:  JJ deleting 0xb66b20 5183
> > 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 
> > This problem was initially fairly easy to reproduce, but since I
> > started adding instrumentation specifically to catch it, it has become
> > devilishly hard to reproduce.
> 
> I think I see the problem here now, after letting this test rig run for
> a while.
> 
> First, the fact that there are holes in members/ files because of the
> random order in deletion, in itself, seems harmless, because the files
> remaining in between will be deleted by a future vacuum.
> 
> Now, the real problem is that we delete files during vacuum, but the
> state that marks those file as safely deletable is written as part of a
> checkpoint record, not by vacuum itself (vacuum writes its state in
> pg_database, but a checkpoint derives its info from a shared memory
> variable.)  Taken together, this means that if there's a crash between
> the vacuum that does a deletion and the next checkpoint, we might
> attempt to read an offset file that is not supposed to be part of the
> live range -- but we forgot that because we didn't reach the point where
> we save the shmem state to disk.

> It seems to me that we need to keep the offsets files around until a
> checkpoint has written the "oldest" number to WAL.  In other words we
> need additional state in shared memory: (a) what we currently store
> which is the oldest number as computed by vacuum (not safe to delete,
> but it's the number that the next checkpoint must write), and (b) the
> oldest number that the last checkpoint wrote (the safe deletion point).

Why not just WAL log truncations? If we'd emit the WAL record after
determining the offsets page we should be safe I think? That seems like
easier and more robust fix? And it's what e.g. the clog does.

> Another thing I noticed is that more than one vacuum process can try to
> run deletion simultaneously, at least if they crash frequently while
> they were doing deletion.  I don't see that this is troublesome, even
> though they might attempt to delete the same files.

That actually seems to be bad to me. It might fail to fail, but still.

> Finally, I noticed that we first read the oldest offset file, then
> determine the member files to delete; then delete offset files, then
> delete member files.  Which means that we would delete offset files that
> correspond to member files that we keep (assuming there is a crash in
> the middle of deletion).  It seems to me we should first delete members,
> then delete offsets, if we wanted to be very safe about it.  I don't
> think this really would matters much, if we were to do things safely as
> described above.

Part of that seems to be solveable by WAL logging truncations. But I
also think that we need a more robust tracking of the oldest member
offset - we still aren't safe against member wraparounds. And I don't
see how we can be without explicitly tracking the oldest member instead
of the ugly 'search for oldest offset segment and map that to members'
thing we're doing now.

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to