Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

MauMau Mon, 10 Jun 2013 07:05:43 -0700

From: "Craig Ringer" <cr...@2ndquadrant.com>

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.

Although I still lack understanding of PostgreSQL implementation, I have anoptimistic feeling that such complexity would not be required. While atablespace is offline, subsequent access from new or existing transactionsis rejected with an error "tablespace is offline". So new WAL records wouldnot be generated for the offline tablespace. To take the tablespace backonline, the DBA performs per-tablespace archive recovery. Per-tablespacearchive recovery restores tablespace data files from the backup, then readthrough archive and pg_xlog/ WAL as usual, and selectively applies WALrecords for the tablespace.

I don't think it's a "must-be-fixed" problem that the WAL for alltablespaces is mixed in one location. I suppose we can tolerate thatarchive recovery takes a long time.

To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.

Per-tablespace WAL is very interesting for another reason -- massive-scaleOLTP for database consolidation. This feature would certainly be abreakthrough for amazing performance, because WAL is usually the lastbottleneck in OLTP. Yes, I can imagine recovery would be much, much morecomplicated,.

None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.

As I mentioned above, in my shallow understanding, it seems that theadditional complexities can be controlled.

The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is "your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby". Not pretty, but

at least there's the option of using a sync standby and avoiding dataloss.


Sync standby... maybe.  Let me consider this.

How would you approach this?

Thanks Craig, you gave me some interesting insights. All of these topicsare interesting, and I'd like to work on them when I have acquired enoughknowledge and experience in PostgreSQL development.


Regards
MauMau





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Reply via email to