Re: [HACKERS] [bug fix] PITR corrupts the database cluster

Heikki Linnakangas Wed, 24 Jul 2013 06:15:48 -0700

Andres Freund <and...@2ndquadrant.com> wrote:
>On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote:
>> Andres Freund <and...@2ndquadrant.com> wrote:
>> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote:
>> >> > <Approach 2>
>> >> What we imo could do would be to drop the tablespaces in a
>*separate*
>> >> transaction *after* the transaction that removed the pg_tablespace
>> >> entry. Then an "incomplete actions" logic similar to btree and gin
>> >could
>> >> be used to remove the database directory if we crashed between the
>> >two
>> >> transactions.
>> >> 
>> >> SO:
>> >> TXN1 does:
>> >> * remove catalog entries
>> >> * drop buffers
>> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN)
>> >> 
>> >> TXN2:
>> >> * remove_dbtablespaces
>> >> * XLogInsert(XLOG_DBASE_DROP_FINISH)
>> >> 
>> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup
>> >callback
>> >> (which would perform TXN2 if we failed inbetween) and a
>> >> rm_safe_restartpoint which would prevent restartpoints from
>occuring
>> >on
>> >> standby between both.
>> >> 
>> >> The same should probably done for CREATE DATABASE because that
>> >currently
>> >> can result in partially copied databases lying around.
>> >
>> >And CREATE/DROP TABLESPACE.
>> >
>> >Not really related, but CREATE DATABASE's implementation makes me
>itch
>> >everytime I read parts of it...
>> 
>> I've been hoping that we could get rid of the rm_cleanup mechanism
>entirely. I eliminated it for gist a while back, and I've been thinking
>of doing the same for gin and btree. The way it works currently is
>buggy - while we have rm_safe_restartpoint to avoid creating a
>restartpoint at a bad moment, there is nothing to stop you from running
>a checkpoint while incomplete actions are pending. It's possible that
>there are page locks or something that prevent it in practice, but it
>feels shaky.
>> 
>> So I'd prefer a solution that doesn't rely on rm_cleanup.
>Piggybacking on commit record seems ok to me, though if we're going to
>have a lot of different things to attach there, maybe we need to
>generalize it somehow. Like, allow resource managers to attach
>arbitrary payload to the commit record, and provide a new
>rm_redo_commit function to replay them.
>
>The problem is that piggybacking on the commit record doesn't really
>fix
>the problem that we end up with a bad state if we crash in a bad
>moment.
>
>For CREATE DATABASE you will have to copy the template database
>*before*
>you commit the pg_database insert. Which means if we abort before that
>we have old data in the datadir.
>
>For DROP DATABASE, without something like incomplete actions,
>piggybacking on the commit record doesn't solve the issue of
>CHECKPOINTS
>either, because the commit record you piggybacked on could have
>committed before a checkpoint, while you still were busy deleting all
>the files.


That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If 
you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's 
just a waste of space, everything still works.

It would be  nice to fix that leak, for tables and indexes too...


- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] PITR corrupts the database cluster

Reply via email to