Andres Freund <and...@2ndquadrant.com> wrote: >On 2013-07-24 15:45:52 +0300, Heikki Linnakangas wrote: >> Andres Freund <and...@2ndquadrant.com> wrote: >> >On 2013-07-24 12:59:43 +0200, Andres Freund wrote: >> >> > <Approach 2> >> >> What we imo could do would be to drop the tablespaces in a >*separate* >> >> transaction *after* the transaction that removed the pg_tablespace >> >> entry. Then an "incomplete actions" logic similar to btree and gin >> >could >> >> be used to remove the database directory if we crashed between the >> >two >> >> transactions. >> >> >> >> SO: >> >> TXN1 does: >> >> * remove catalog entries >> >> * drop buffers >> >> * XLogInsert(XLOG_DBASE_DROP_BEGIN) >> >> >> >> TXN2: >> >> * remove_dbtablespaces >> >> * XLogInsert(XLOG_DBASE_DROP_FINISH) >> >> >> >> The RM_DBASE_ID resource manager would then grow a rm_cleanup >> >callback >> >> (which would perform TXN2 if we failed inbetween) and a >> >> rm_safe_restartpoint which would prevent restartpoints from >occuring >> >on >> >> standby between both. >> >> >> >> The same should probably done for CREATE DATABASE because that >> >currently >> >> can result in partially copied databases lying around. >> > >> >And CREATE/DROP TABLESPACE. >> > >> >Not really related, but CREATE DATABASE's implementation makes me >itch >> >everytime I read parts of it... >> >> I've been hoping that we could get rid of the rm_cleanup mechanism >entirely. I eliminated it for gist a while back, and I've been thinking >of doing the same for gin and btree. The way it works currently is >buggy - while we have rm_safe_restartpoint to avoid creating a >restartpoint at a bad moment, there is nothing to stop you from running >a checkpoint while incomplete actions are pending. It's possible that >there are page locks or something that prevent it in practice, but it >feels shaky. >> >> So I'd prefer a solution that doesn't rely on rm_cleanup. >Piggybacking on commit record seems ok to me, though if we're going to >have a lot of different things to attach there, maybe we need to >generalize it somehow. Like, allow resource managers to attach >arbitrary payload to the commit record, and provide a new >rm_redo_commit function to replay them. > >The problem is that piggybacking on the commit record doesn't really >fix >the problem that we end up with a bad state if we crash in a bad >moment. > >For CREATE DATABASE you will have to copy the template database >*before* >you commit the pg_database insert. Which means if we abort before that >we have old data in the datadir. > >For DROP DATABASE, without something like incomplete actions, >piggybacking on the commit record doesn't solve the issue of >CHECKPOINTS >either, because the commit record you piggybacked on could have >committed before a checkpoint, while you still were busy deleting all >the files.
That's no different from CREATE TABLE / INDEX and DROP TABLE / INDEX. E.g. If you crash after CREATE TABLE but before COMMIT, the file is leaked. But it's just a waste of space, everything still works. It would be nice to fix that leak, for tables and indexes too... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers