Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-22 18:19:25 -0600, Jim Nasby wrote: On 1/21/14, 6:46 PM, Andres Freund wrote: On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if that file still exists would be cheap so we can do that before entering the appropriate critical section. That'd be noticeably expensive. Opening/stat a file isn't cheap, especially if you do it via filename and not via fd which we'd have to do. I am pretty sure it would be noticeably in single client workloads, but it'd damned sure will be noticeable on busy multi-socket workloads. I still think doing the checks in the wal writer is the best bet, setting a flag that can then cheaply be tested in shared memory. When set it will cause any further action that will write xlog to error out unless it's already in progress. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 23 January 2014 01:19, Jim Nasby j...@nasby.net wrote: On 1/21/14, 6:46 PM, Andres Freund wrote: On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if that file still exists would be cheap so we can do that before entering the appropriate critical section. There's still a small chance that we'd end up panicing, but it's better than today. I'd argue that even if it doesn't work for CoW filesystems it'd still be a win. I grant that it does sound simple enough for a partial stop gap. My concern is that it provides only a short delay before the eventual disk-full situation, which it doesn't actually prevent. IMHO the main issue now is how we clear down old WAL files. We need to perform a checkpoint to do that - and as has been pointed out in relation to my proposal, we cannot complete that because of locks that will be held for some time when we do eventually lock up. That issue is not solved by having a ballast file(s). IMHO we need to resolve the deadlock inherent in the disk-full/WALlock-up/checkpoint situation. My view is that can be solved in a similar way to the way the buffer pin deadlock was resolved for Hot Standby. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-23 13:56:49 +0100, Simon Riggs wrote: IMHO we need to resolve the deadlock inherent in the disk-full/WALlock-up/checkpoint situation. My view is that can be solved in a similar way to the way the buffer pin deadlock was resolved for Hot Standby. I don't think that approach works here. We're not talking about mere buffer pins but the big bad exclusively locked buffer which is held by a backend in a critical section. Killing such a backend costs you a PANIC. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 01:23, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 18:59:13 -0500, Tom Lane wrote: Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. I think removing excess subxacts from commit and abort records could be a good idea. Not sure anybody considered it before. We already emit xid allocation records when we overflow, so we already know which subxids have been allocated. We also issue subxact abort records for anything that aborted. So in theory we should be able to reconstruct an arbitrarily long chain of subxacts. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 01:30, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: How are we supposed to wait while e.g. ProcArrayLock? Aborting transactions doesn't work either, that writes abort records which can get signficantly large. Yeah, that's an interesting point ;-). We can't *either* commit or abort without emitting some WAL, possibly quite a bit of WAL. Right, which is why we don't need to lock ProcArrayLock. As soon as we try to write a commit or abort it goes through the normal XLogInsert route. As soon as wal_buffers fills WALWriteLock will be held continuously until we free some space. Since ProcArrayLock isn't held, read-only users can continue. As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Note that my proposal would not require aborting any in-flight transactions; they would continue to completion as soon as space is cleared. My proposal again, so we can review how simple it was... 1. Allow a checkpoint to complete by updating the control file, rather than writing WAL. The control file is already there and is fixed size, so we can be more confident it will accept the update. We could add a new checkpoint mode for that, or we could do that always for shutdown checkpoints (my preferred option). EFFECT: Since a checkpoint can now be called and complete without writing WAL, we are able to write dirty buffers and then clean out WAL files to reduce space. 2. If we fill the disk when writing WAL we do not PANIC, we signal the checkpointer process to perform an immediate checkpoint and then wait for its completion. EFFECT: Since we are holding WALWriteLock, all other write users will soon either wait for that lock directly or indirectly. Both of those points are relatively straightforward to implement and this proposal minimises seldom-tested code paths. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... Good point. We would need to take special action for any dirty blocks that we cannot obtain content lock for, which should be a smallish list, to be dealt with right at the end of the checkpoint writes. We know that anyone waiting for the WAL lock will not be modifying the block and so we can copy it without obtaining the lock. We can inspect the lock queue on the WAL locks and then see which buffers we can skip the lock for. The alternative of adding a check for WAL space to the normal path is a non-starter, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Tom Lane t...@sss.pgh.pa.us wrote: Well, PANIC is certainly bad, but what I'm suggesting is that we just focus on getting that down to ERROR and not worry about trying to get out of the disk-shortage situation automatically. Nor do I believe that it's such a good idea to have the database freeze up until space appears rather than reporting errors. Dusting off my DBA hat for a moment, I would say that I would be happy if each process either generated an ERROR or went into a wait state. They would not all need to do the same thing; whichever is easier in each process's context would do. It would be nice if a process which went into a long wait state until disk space became available would issue a WARNING about that, where that is possible. I feel that anyone running a database that matters to them should be monitoring disk space and getting an alert from the monitoring system in advance of actually running out of disk space; but we all know that poorly managed systems are out there. To accomodate them we don't want to add risk or performance hits for those who do manage their systems well. The attempt to make more space by deleting WAL files scares me a bit. The dynamics of that seem like they could be counterproductive if the pg_xlog directory is on the same filesystem as everything else and there is a rapidly growing file. Every time you cleaned up, the fast-growing file would eat more of the space pg_xlog had been using, until perhaps it had no space to keep even a single segment, no? And how would that work with a situation where the archive_command was failing, causing a build-up WAL files? It just seems like too much mechanism and incomplete coverage of the problem space. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 21:42:19 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 19:45:19 -0500, Tom Lane wrote: I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. Would that work for the promotion case as well? Afair there's the assumption that everything = TransactionXmin can be looked up in pg_subtrans or in the procarray - which afaics wouldn't be the case with your scheme? And TransactionXmin could very well be below such an incomplete commit's xids afaics. Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. I (think) I get what you want to do, but splitting the record like that nonetheless opens up behaviour that previously wasn't there. Imagine we promote inbetween replaying the list of subxacts (only storing it in memory) and the main commit record. Either we have something like the incomplete action stuff doing something with the in-memory data, or we are in a situation where there can be xids bigger than TransactionXmin that are not in pg_subtrans and not in the procarray. Which I don't think exists today since we either read the commit record in it's entirety or not. We'd also need to use the MyPgXact-delayChkpt mechanism to prevent checkpoints from occuring inbetween those records, but we do that already, so that seems rather uncontroversial. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 21:42:19 -0500, Tom Lane wrote: Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. I (think) I get what you want to do, but splitting the record like that nonetheless opens up behaviour that previously wasn't there. Obviously we are not on the same page yet. In my vision, the WAL writer is dumping the same data it would have dumped, though in a different layout, and it's working from process-local state same as it does now. The WAL replayer is taking the same actions at the same time using the same data as it does now. There is no behavior that wasn't there, unless you're claiming that there are *existing* race conditions in commit/abort WAL processing. The only thing that seems mildly squishy about this is that it's not clear how long the WAL replayer ought to hang onto subsidiary records for a commit or abort it hasn't seen yet. In the case where we change our minds and abort a transaction after already having written some subsidiary records for the commit, it's not really a problem; the replayer can throw away any saved data related to the commit of xid N as soon as it sees an abort for xid N. However, what if the session crashes and never writes either a final commit or abort record? I think we can deal with this fairly easily though, because that case should end with a crash recovery cycle writing a shutdown checkpoint to the log (we do do that no?). So the rule can be discard any unmatched subsidiary records if you see a shutdown checkpoint. This makes sense on its own terms since there are surely no active transactions at that point in the log. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 22 January 2014 14:25, Simon Riggs si...@2ndquadrant.com wrote: On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/22/2014 02:10 PM, Simon Riggs wrote: As Jeff points out, the blocks being modified would be locked until space is freed up. Which could make other users wait. The code required to avoid that wait would be complex and not worth any overhead. Checkpoint also acquires the content lock of every dirty page in the buffer cache... Good point. We would need to take special action for any dirty blocks that we cannot obtain content lock for, which should be a smallish list, to be dealt with right at the end of the checkpoint writes. We know that anyone waiting for the WAL lock will not be modifying the block and so we can copy it without obtaining the lock. We can inspect the lock queue on the WAL locks and then see which buffers we can skip the lock for. This could be handled similarly to the way we handle buffer pin deadlocks in Hot Standby. So I don't see any blockers from that angle. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 1/21/14, 6:46 PM, Andres Freund wrote: On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. If we're not looking for perfection, what's wrong with Peter's idea of a ballast file? Presumably the check to see if that file still exists would be cheap so we can do that before entering the appropriate critical section. There's still a small chance that we'd end up panicing, but it's better than today. I'd argue that even if it doesn't work for CoW filesystems it'd still be a win. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the Redesigning checkpoint_segments thread, many people opined that there should be a hard limit on the amount of disk space used for WAL: http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6s4-ppd+6jiy4oq78i...@mail.gmail.com. I'm starting a new thread on that, because that's mostly orthogonal to redesigning checkpoint_segments. The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can try to avoid that by checkpointing early enough, so that we can remove old WAL segments to make room for new ones before you run out, but unless we somehow throttle or stop new WAL insertions, it's always going to be possible to use up all disk space. A typical scenario where that happens is when archive_command fails for some reason; even a checkpoint can't remove old, unarchived segments in that case. But it can happen even without WAL archiving. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. Instead of PANICing, we should simply signal the checkpointer to perform a shutdown checkpoint. That normally requires a WAL insertion to complete, but it seems easy enough to make that happen by simply rewriting the control file, after which ALL WAL files are superfluous for crash recovery and can be deleted. Once that checkpoint is complete, we can begin deleting WAL files that are archived/replicated and continue as normal. The previously failing WAL write can now be made again and may succeed this time - if it does, we continue, if not - now we PANIC. Note that this would not require in-progress transactions to be aborted. They can continue normally once wal_buffers re-opens. We don't really want anything too drastic, because if this situation happens once it may happen many times - I'm imagining a flaky network etc.. So we want the situation to recover quickly and easily, without too many consequences. The above appears to be very minimal change from existing code and doesn't introduce lots of new points of breakage. I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog was on a separate partition that had nothing else on it. The partition filled up, and the system shut down with a PANIC. Because there was no space left, it could not even write the checkpoint after recovery, and thus refused to start up again. There was nothing else on the partition that you could delete to make space. The only recourse would've been to add more disk space to the partition (impossible), or manually delete an old WAL file that was not needed to recover from the latest checkpoint (scary). Fortunately this was a test system, so we just deleted everything. Doing shutdown checkpoints via the control file would exactly solve that issue. We already depend upon the readability of the control file anyway, so this changes nothing. (And if you regard it does, then we can have multiple control files, or at least a backup control file at shutdown). We can make the shutdown checkpoint happen always at EOF of a WAL segment, so at shutdown we don't need any WAL files to remain at all. So we need to somehow stop new WAL insertions from happening, before it's too late. I don't think we do. What might be sensible is to have checkpoints speed up as WAL volume approaches a predefined limit, so that we minimise the delay caused when wal_buffers locks up. Not suggesting anything here for 9.4, since we're midCF. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Simon Riggs si...@2ndquadrant.com writes: On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. I'm not sure that all writing transactions lock up hard is really so much better than the current behavior. My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? Instead of PANICing, we should simply signal the checkpointer to perform a shutdown checkpoint. And if that fails for lack of disk space? In any case, what you're proposing sounds like a lot of new complication in a code path that is necessarily never going to be terribly well tested. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 21 January 2014 18:35, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. I'm not sure that all writing transactions lock up hard is really so much better than the current behavior. Lock up momentarily, until the situation clears. But my proposal would allow the situation to fully clear, i.e. all WAL files could be deleted as soon as replication/archiving has caught up. The current behaviour doesn't automatically correct itself as this proposal would. My proposal is also fully safe in line with synchronous replication, as well as zero performance overhead for mainline processing. My preference would be that we simply start failing writes with ERRORs rather than PANICs. Yes, that is what I am proposing, amongst other points. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? Yes, I think so. Instead of PANICing, we should simply signal the checkpointer to perform a shutdown checkpoint. And if that fails for lack of disk space? I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space. In any case, what you're proposing sounds like a lot of new complication in a code path that is necessarily never going to be terribly well tested. It's the smallest amount of change proposed so far... I agree on the danger of untested code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Fwiw I think all transactions lock up until space appears is *much* better than PANICing. Often disks fill up due to other transient storage or people may have options to manually increase the amount of space. it's much better if the database just continues to function after that rather than need to be restarted. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Greg Stark st...@mit.edu writes: Fwiw I think all transactions lock up until space appears is *much* better than PANICing. Often disks fill up due to other transient storage or people may have options to manually increase the amount of space. it's much better if the database just continues to function after that rather than need to be restarted. Well, PANIC is certainly bad, but what I'm suggesting is that we just focus on getting that down to ERROR and not worry about trying to get out of the disk-shortage situation automatically. Nor do I believe that it's such a good idea to have the database freeze up until space appears rather than reporting errors. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. I'm not sure that all writing transactions lock up hard is really so much better than the current behavior. My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR. Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Jeff Janes jeff.ja...@gmail.com writes: On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR. Well, the point is we'd have to somehow push detection of the problem to a point before the critical section that does the buffer changes and WAL insertion. The first idea that comes to mind is (1) estimate the XLOG space needed (an overestimate is fine here); (2) just before entering the critical section, call some function to reserve that space, such that we always have at least sum(outstanding reservations) available future WAL space; (3) release our reservation as part of the actual XLogInsert call. The problem here is that the reserve function would presumably need an exclusive lock, and would be about as much of a hot spot as XLogInsert itself is. Plus we'd be paying a lot of extra cycles to solve a corner case problem that, with all due respect, comes up pretty darn seldom. So probably we need a better idea than that. Maybe we could get some mileage out of the fact that very approximate techniques would be good enough. For instance, I doubt anyone would bleat if the system insisted on having 10MB or even 100MB of future WAL space always available. But I'm not sure exactly how to make use of that flexibility. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Tue, Jan 21, 2014 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Maybe we could get some mileage out of the fact that very approximate techniques would be good enough. For instance, I doubt anyone would bleat if the system insisted on having 10MB or even 100MB of future WAL space always available. But I'm not sure exactly how to make use of that flexibility. In the past I've thought that one approach that would eliminate concerns about portably and efficiently knowing how much space is left on the pg_xlog filesystem is to have a ballast file. Under this scheme, perhaps XLogInsert() could differentiate between a soft and hard failure. Hopefully the reserve function you mentioned, which is still called at the same place, just before each critical section thereby becomes cheap. Perhaps I'm just restating what you said, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 18:24:39 -0500, Tom Lane wrote: Jeff Janes jeff.ja...@gmail.com writes: On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR. Well, the point is we'd have to somehow push detection of the problem to a point before the critical section that does the buffer changes and WAL insertion. Well, I think that's already hard for the heapam.c stuff, but doing that in xact.c seems fracking hairy. We can't simply stop in the middle of a commit and not continue, that'd often grind the system to a halt preventing cleanup. Additionally the size of the inserted record for commits is essentially unbounded, which makes it an especially fun case. The first idea that comes to mind is (1) estimate the XLOG space needed (an overestimate is fine here); (2) just before entering the critical section, call some function to reserve that space, such that we always have at least sum(outstanding reservations) available future WAL space; (3) release our reservation as part of the actual XLogInsert call. I think that's not necessarily enough. In a COW filesystem like btrfs or ZFS you really cannot give much guarantees about writes suceeding or failing, even if we were able to create (and zero) a new segment. Even if you disregard that, we'd need to keep up with lots of concurrent reservations, looking a fair bit into the future. E.g. during a smart shutdown in a workload with lots of subtransactions trying to reserve space might make the situation actually worse because we might end up trying to reserve the combined size of records. The problem here is that the reserve function would presumably need an exclusive lock, and would be about as much of a hot spot as XLogInsert itself is. Plus we'd be paying a lot of extra cycles to solve a corner case problem that, with all due respect, comes up pretty darn seldom. So probably we need a better idea than that. Yea, I don't think anything really safe is going to work without signifcant penalties. Maybe we could get some mileage out of the fact that very approximate techniques would be good enough. For instance, I doubt anyone would bleat if the system insisted on having 10MB or even 100MB of future WAL space always available. But I'm not sure exactly how to make use of that flexibility. If we'd be more aggressive with preallocating WAL files and doing so in the WAL writer, we could stop accepting writes in some common codepaths (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd still fail if you write a *very* large commit record using up all the reserve though... I personally think this isn't worth complicating the code for. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 18:24:39 -0500, Tom Lane wrote: Maybe we could get some mileage out of the fact that very approximate techniques would be good enough. For instance, I doubt anyone would bleat if the system insisted on having 10MB or even 100MB of future WAL space always available. But I'm not sure exactly how to make use of that flexibility. If we'd be more aggressive with preallocating WAL files and doing so in the WAL writer, we could stop accepting writes in some common codepaths (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd still fail if you write a *very* large commit record using up all the reserve though... I personally think this isn't worth complicating the code for. I too have got doubts about whether a completely bulletproof solution is practical. (And as you say, even if our internal logic was bulletproof, a COW filesystem defeats all guarantees in this area anyway.) But perhaps a 99% solution would be a useful compromise. Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 18:59:13 -0500, Tom Lane wrote: Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. Which cases of essentially unbounded record sizes do we currently have? The only ones that I remember right now are commit and abort records (including when wrapped in a prepared xact). Containing a) the list of committing/aborting subtransactions and b) the list of files to drop c) cache invalidations. Hm. There's also xl_standby_locks, but that'd be easily splittable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. If we'd truncate it in concord with pg_clog, we'd only need to log the subxids which haven't been explicitly assigned. Unfortunately pg_subtrans will be bigger than pg_clog, making such a scheme likely to be painful. We could relatively easily split of logging the dropped files from commit records and log them in groups afterwards, we already have several races allowing to leak files. We could do something similar for the cache invalidations, but that seems likely to get rather ugly, as we'd need to hold procarraylock till the next record is read, or until a shutdown or end-of-recovery record is read. If one of the latter is found before the corresponding invalidations, we'd need to invalidate the entire syscache. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 21 January 2014 23:01, Jeff Janes jeff.ja...@gmail.com wrote: On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. I don't see we need to prevent WAL insertions when the disk fills. We still have the whole of wal_buffers to use up. When that is full, we will prevent further WAL insertions because we will be holding the WALwritelock to clear more space. So the rest of the system will lock up nicely, like we want, apart from read-only transactions. I'm not sure that all writing transactions lock up hard is really so much better than the current behavior. My preference would be that we simply start failing writes with ERRORs rather than PANICs. I'm not real sure ATM why this has to be a PANIC condition. Probably the cause is that it's being done inside a critical section, but could we move that? My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR. Agreed. You don't say it but I presume you intend to point out that such long-lived contention could easily have a knock on effect to other read-only statements. I'm pretty sure other databases work the same way. Our choice are 1. Waiting 2. Abort transactions 3. Some kind of release-locks-then-wait-and-retry (3) is a step too far for me, even though it is easier than you say since we write WAL before changing the data block so a failure to insert WAL could just result in a temporary drop lock, sleep and retry. I would go for (1) waiting for up to checkpoint_timeout then (2), if we think that is a problem. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-22 01:18:36 +0100, Simon Riggs wrote: My understanding is that if it runs out of buffer space while in an XLogInsert, it will be holding one or more buffer content locks exclusively, and unless it can complete the xlog (or scrounge up the info to return that buffer to its previous state), it can never release that lock. There might be other paths were it could get by with an ERROR, but if no one can write xlog anymore, all of those paths must quickly converge to the one that cannot simply ERROR. Agreed. You don't say it but I presume you intend to point out that such long-lived contention could easily have a knock on effect to other read-only statements. I'm pretty sure other databases work the same way. Our choice are 1. Waiting 2. Abort transactions 3. Some kind of release-locks-then-wait-and-retry (3) is a step too far for me, even though it is easier than you say since we write WAL before changing the data block so a failure to insert WAL could just result in a temporary drop lock, sleep and retry. I would go for (1) waiting for up to checkpoint_timeout then (2), if we think that is a problem. How are we supposed to wait while e.g. ProcArrayLock? Aborting transactions doesn't work either, that writes abort records which can get signficantly large. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 18:59:13 -0500, Tom Lane wrote: Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. We could relatively easily split of logging the dropped files from commit records and log them in groups afterwards, we already have several races allowing to leak files. I was thinking the other way around: emit the subsidiary records before the atomic commit or abort record, indeed before we've actually committed. Part of the point is to reduce the risk that lack of WAL space would prevent us from fully committing. Also, writing those records afterwards increases the risk of a post-commit failure, which is a bad thing. Replay would then involve either accumulating the subsidiary records in memory, or being willing to go back and re-read them when the real commit or abort record is seen. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: How are we supposed to wait while e.g. ProcArrayLock? Aborting transactions doesn't work either, that writes abort records which can get signficantly large. Yeah, that's an interesting point ;-). We can't *either* commit or abort without emitting some WAL, possibly quite a bit of WAL. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund and...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown. Heikki said somewhere upthread that he'd be happy with a solution that only catches 90% of the cases. That is probably a conservative estimate. The schemes discussed here would probably be much more effective than that in practice. Sure, you can still poke holes in them. For example, there has been some discussion of arbitrarily large commit records. However, this is the kind of thing just isn't that relevant in the real world. I believe that in practice the majority of commit records are all about the same size. I do not believe that the two acceptable outcomes here are either that we continue to always PANIC shutdown (i.e. do nothing), or promise to never PANIC shutdown. There is likely to be a third way, which is that the probability of a PANIC shutdown is, at the macro level, reduced somewhat from the present probability of 1.0. People are not going to develop a lackadaisical attitude about running out of disk space on the pg_xlog partition if we do so. They still have plenty of incentive to make sure that that doesn't happen. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 19:23:57 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 18:59:13 -0500, Tom Lane wrote: Another thing to think about is whether we couldn't put a hard limit on WAL record size somehow. Multi-megabyte WAL records are an abuse of the design anyway, when you get right down to it. So for example maybe we could split up commit records, with most of the bulky information dumped into separate records that appear before the real commit. This would complicate replay --- in particular, if we abort the transaction after writing a few such records, how does the replayer realize that it can forget about those records? But that sounds probably surmountable. I think removing the list of subtransactions from commit records would essentially require not truncating pg_subtrans after a restart anymore. I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. That'd likely require something similar to the incomplete actions used in btrees (and until recently in more places). I think that is/was a disaster I really don't want to extend. We could relatively easily split of logging the dropped files from commit records and log them in groups afterwards, we already have several races allowing to leak files. I was thinking the other way around: emit the subsidiary records before the atomic commit or abort record, indeed before we've actually committed. Part of the point is to reduce the risk that lack of WAL space would prevent us from fully committing. Replay would then involve either accumulating the subsidiary records in memory, or being willing to go back and re-read them when the real commit or abort record is seen. Well, the reason I suggested doing it the other way round is that we wouldn't need to reassemble anything (outside of cache invalidations which I don't know how to handle that way) which I think is a significant increase in robustness and decrease in complexity. Also, writing those records afterwards increases the risk of a post-commit failure, which is a bad thing. Well, most of those could be done outside of a critical section, possibly just FATALing out. Beats PANICing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 19:23:57 -0500, Tom Lane wrote: I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. That'd likely require something similar to the incomplete actions used in btrees (and until recently in more places). I think that is/was a disaster I really don't want to extend. I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund and...@2ndquadrant.com wrote: I personally think this isn't worth complicating the code for. You're probably right. However, I don't see why the bar has to be very high when we're considering the trade-off between taking some emergency precaution against having a PANIC shutdown, and an assured PANIC shutdown Well, the problem is that the tradeoff would very likely include making already complex code even more complex. None of the proposals, even the one just decreasing the likelihood of a PANIC, like like they'd end up being simple implementation-wise. And that additional complexity would hurt robustness and prevent things I find much more important than this. Heikki said somewhere upthread that he'd be happy with a solution that only catches 90% of the cases. That is probably a conservative estimate. The schemes discussed here would probably be much more effective than that in practice. Sure, you can still poke holes in them. For example, there has been some discussion of arbitrarily large commit records. However, this is the kind of thing just isn't that relevant in the real world. I believe that in practice the majority of commit records are all about the same size. Yes, realistically the boundary will be relatively low, but I don't think that that means that we can disregard issues like the possibility that a record might be bigger than wal_buffers. Not because it'd allow theoretical issues, but because it rules out several tempting approaches like e.g. extending the in-memory reservation scheme of Heikki's scalability work to handle this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2014-01-21 19:45:19 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 19:23:57 -0500, Tom Lane wrote: I'm not suggesting that we stop providing that information! I'm just saying that we perhaps don't need to store it all in one WAL record, if instead we put the onus on WAL replay to be able to reconstruct what it needs from a series of WAL records. That'd likely require something similar to the incomplete actions used in btrees (and until recently in more places). I think that is/was a disaster I really don't want to extend. I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. Would that work for the promotion case as well? Afair there's the assumption that everything = TransactionXmin can be looked up in pg_subtrans or in the procarray - which afaics wouldn't be the case with your scheme? And TransactionXmin could very well be below such an incomplete commit's xids afaics. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Andres Freund and...@2ndquadrant.com writes: On 2014-01-21 19:45:19 -0500, Tom Lane wrote: I don't think that's a comparable case. Incomplete actions are actions to be taken immediately, and which the replayer then has to complete somehow if it doesn't find the rest of the action in the WAL sequence. The only thing to be done with the records I'm proposing is to remember their contents (in some fashion) until it's time to apply them. If you hit end of WAL you don't really have to do anything. Would that work for the promotion case as well? Afair there's the assumption that everything = TransactionXmin can be looked up in pg_subtrans or in the procarray - which afaics wouldn't be the case with your scheme? And TransactionXmin could very well be below such an incomplete commit's xids afaics. Uh, what? The behavior I'm talking about is *exactly the same* as what happens now. The only change is that the data sent to the WAL file is laid out a bit differently, and the replay logic has to work harder to reassemble it before it can apply the commit or abort action. If anything outside replay can detect a difference at all, that would be a bug. Once again: the replayer is not supposed to act immediately on the subsidiary records. It's just supposed to remember their contents so it can reattach them to the eventual commit or abort record, and then do what it does today to replay the commit or abort. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote: (I'm still learning the details of Pg's WAL, WAL replay and recovery, so the below's just my understanding): The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. The only other thing I can think of is: When a tablespace is offline, write WAL records to a separate tablespace recovery log as they're encountered. Replay this log when the tablespace comes is restored, before applying any other new WAL to the tablespace. This wouldn't affect archive-based recovery since it'd already have the records from the original WAL. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby. Not pretty, but at least there's the option of using a sync standby and avoiding data loss. How would you approach this? Sorry to be replying late. You are right that you could record/apply WAL separately for offline tablespaces. The problem is that you could have logical ties from the offline tablespace to online tablespaces. For example, what happens if data in an online tablespace references a primary key in an offline tablespace. What if the system catalogs are stored in an offline tablespace? Right now, we allow logical bindings across physical tablespaces. To do what you want, you would really need to store each database in its own tablespace. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Peter Eisentraut pete...@gmx.net writes: I suspect that there are actually only about 5 or 6 common ways to do archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why we can't fully specify and/or script what to do in each of these cases. And provide either fully reliable contrib scripts or internal archive commands ready to use for those common cases. I can't think of other common use cases, by the way. +1 Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote: Yes, I feel designing reliable archiving, even for the simplest case - copy WAL to disk, is very difficult. I know there are following three problems if you just follow the PostgreSQL manual. Average users won't notice them. I guess even professional DBAs migrating from other DBMSs won't, either. 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. 2. archive_command dumps core when you run pg_ctl stop -mi. This is because postmaster sends SIGQUIT to all its descendants. The core files accumulate in the data directory, which will be backed up with the database. Of course those core files are garbage. archive_command script needs to catch SIGQUIT and exit. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. I hope PostgreSQL will provide a reliable archiving facility that is ready to use. +1. I think we should have a way to set an archive DIRECTORY, rather than an archive command. And if you set it, then PostgreSQL should just do all of that stuff correctly, without any help from the user. Of course, some users will want to archive to a remote machine via ssh or rsync or what-have-you, and those users will need to provide their own tools. But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Wed, Jun 12, 2013 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote: I hope PostgreSQL will provide a reliable archiving facility that is ready to use. +1. I think we should have a way to set an archive DIRECTORY, rather than an archive command. And if you set it, then PostgreSQL should just do all of that stuff correctly, without any help from the user. Of course, some users will want to archive to a remote machine via ssh or rsync or what-have-you, and those users will need to provide their own tools. But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. That, or provide a standard archive command that takes the directory as argument? I bet we have tons of those available among us users... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote: Yes, I feel designing reliable archiving, even for the simplest case - copy WAL to disk, is very difficult. I know there are following three problems if you just follow the PostgreSQL manual. Average users won't notice them. I guess even professional DBAs migrating from other DBMSs won't, either. 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. 2. archive_command dumps core when you run pg_ctl stop -mi. This is because postmaster sends SIGQUIT to all its descendants. The core files accumulate in the data directory, which will be backed up with the database. Of course those core files are garbage. archive_command script needs to catch SIGQUIT and exit. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. I hope PostgreSQL will provide a reliable archiving facility that is ready to use. +1. I think we should have a way to set an archive DIRECTORY, rather than an archive command. And if you set it, then PostgreSQL should just do all of that stuff correctly, without any help from the user. Of course, some users will want to archive to a remote machine via ssh or rsync or what-have-you, and those users will need to provide their own tools. But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. And there's another example why we need an archive command: I'm just setting up pgpool replication on Amazon AWS. I'm sending WAL archives to an S3 bucket, which doesn't appear as a directory on the server. From: http://www.pgpool.net/pipermail/pgpool-general/2013-June/001851.html -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Jun 12, 2013 4:56 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote: Yes, I feel designing reliable archiving, even for the simplest case - copy WAL to disk, is very difficult. I know there are following three problems if you just follow the PostgreSQL manual. Average users won't notice them. I guess even professional DBAs migrating from other DBMSs won't, either. 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. 2. archive_command dumps core when you run pg_ctl stop -mi. This is because postmaster sends SIGQUIT to all its descendants. The core files accumulate in the data directory, which will be backed up with the database. Of course those core files are garbage. archive_command script needs to catch SIGQUIT and exit. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. I hope PostgreSQL will provide a reliable archiving facility that is ready to use. +1. I think we should have a way to set an archive DIRECTORY, rather than an archive command. And if you set it, then PostgreSQL should just do all of that stuff correctly, without any help from the user. Wouldn't that encourage people to do local archiving, which is almost always a bad idea? I'd rather improve the experience with pg_receivexlog or another way that does remote archiving... Of course, some users will want to archive to a remote machine via ssh or rsync or what-have-you, and those users will need to provide their own tools. But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. I guess archiving to a nfs mount or so isn't too bad, but archiving and using a cronjob to get the files off is typically a great way to loose data, and we really shouldn't encourage that by default, Imo. /Magnus
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 7:20 PM, Jeff Janes jeff.ja...@gmail.com wrote: If archiving is on and failure is due to no space, could we just keep trying XLogFileInit again for a couple minutes to give archiving a chance to do its things? Doing that while holding onto locks and a critical section would be unfortunate, but if the alternative is a PANIC, it might be acceptable. Blech. I think that's setting our standards pretty low. It would neither be possible to use the system nor to shut it down cleanly; I think the effect would be to turn an immediate PANIC into a slightly-delayed PANIC, possibly accompanied by some DBA panic. It seems to me that there are two general ways of approaching this problem. 1. Discover sooner that we're out of space. Once we've modified the buffer and entered the critical section, it's too late to have second thoughts about completing the operation. If we could guarantee prior to modifying the buffers that enough WAL space was present to store the record we're about to write, then we'd be certain not to fail for this reason. In theory, this is simple: keep track of how much uncommitted WAL space we have. Increment the value when we create new WAL segments and decrement it by the size of the WAL record we plan to write. In practice, it's not so simple. We don't know whether we're going to emit FPIs until after we enter the critical section, so the size of the record can't be known precisely early enough. We could think about estimating the space needed conservatively and truing it up occasionally. However, there's a second problem: the available-WAL-space counter would surely become a contention point. Here's a sketch of a possible solution. Suppose we know that an individual WAL record can't be larger than, uh, 64kB. I'm not sure there is a limit on the size of a WAL record, but let's say there is, or we can install one, at around that size limit. Before we enter a critical section that's going to write a WAL record, we verify that the amount of WAL space remaining is at least 64kB * MaxBackends. If it's not, we embark on a series of short sleeps, rechecking after each one; if we hit some time limit, we ERROR out. As long as every backend checks this before every WAL record, we can always be sure there will be at least 64kB left for us. With this approach, the shared variable that stores the amount of WAL space remaining only needs to be updated under WALInsertLock; the reservation step only involves a read. That might be cheap enough not to matter. 2. Recover from the fact that we ran out of space by backing out the changes to shared buffers. Initially, I thought this might be a promising approach: if we've modified any shared buffers and discover that we can't log the changes, just invalidate the buffers! Of course, it doesn't work, because the buffer might have have already been dirty when we locked it. So we'd actually need a way to reverse out all the changes we were about to log. That's probably too expensive to contemplate, in general; and the code would likely get so little testing as to invite bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Wed, Jun 12, 2013 at 11:32 AM, Magnus Hagander mag...@hagander.net wrote: Wouldn't that encourage people to do local archiving, which is almost always a bad idea? Maybe, but refusing to improve the UI because people might then use the feature seems wrong-headed. I'd rather improve the experience with pg_receivexlog or another way that does remote archiving... Sure, remote archiving is great, and I'm glad you've been working on it. In general, I think that's a cleaner approach, but there are still enough people using archive_command that we can't throw them under the bus. I guess archiving to a nfs mount or so isn't too bad, but archiving and using a cronjob to get the files off is typically a great way to loose data, and we really shouldn't encourage that by default, Imo. Well, I think what we're encouraging right now is for people to do it wrong. The proliferation of complex tools to manage this process suggests that it is not easy to manage without a complex tool. That's a problem. And we regularly have users who discover, under a variety of circumstances, that they've been doing it wrong. If there's a better solution than hard-wiring some smarts about local directories, I'm all ears - but making the simple case just work would still be better than doing nothing. Right now you have to be a rocket scientist no matter what configuration you're running. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 6/12/13 10:55 AM, Robert Haas wrote: But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. Doesn't that just move the problem to managing NFS or batch jobs? Do we want to encourage that? I suspect that there are actually only about 5 or 6 common ways to do archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why we can't fully specify and/or script what to do in each of these cases. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Wed, Jun 12, 2013 at 12:07 PM, Peter Eisentraut pete...@gmx.net wrote: On 6/12/13 10:55 AM, Robert Haas wrote: But it's got to be pretty common to archive to a local path that happens to be a remote mount, or to a local directory whose contents are subsequently copied off by a batch job. Making that work nicely with near-zero configuration would be a significant advance. Doesn't that just move the problem to managing NFS or batch jobs? Do we want to encourage that? I suspect that there are actually only about 5 or 6 common ways to do archiving (say, local, NFS, scp, rsync, S3, ...). There's no reason why we can't fully specify and/or script what to do in each of these cases. Go for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/12/2013 08:49 AM, Robert Haas wrote: Sure, remote archiving is great, and I'm glad you've been working on it. In general, I think that's a cleaner approach, but there are still enough people using archive_command that we can't throw them under the bus. Correct. I guess archiving to a nfs mount or so isn't too bad, but archiving and using a cronjob to get the files off is typically a great way to loose data, and we really shouldn't encourage that by default, Imo. We certainly not by default but it is also something that can be easy to set up reliably if you know what you are doing. Well, I think what we're encouraging right now is for people to do it wrong. The proliferation of complex tools to manage this process suggests that it is not easy to manage without a complex tool. No. It suggests that people have more than one requirement that the project WILL NEVER be able to solve. Granted we have solved some of them, for example pg_basebackup. However, pg_basebackup isn't really useful for a large database. Multithreaded rsync is much more efficient. That's a problem. And we regularly have users who discover, under a variety of circumstances, that they've been doing it wrong. If there's a better solution than hard-wiring some smarts about local directories, I'm all ears - but making the simple case just work would still be better than doing nothing. Agreed. Right now you have to be a rocket scientist no matter what configuration you're running. This is quite a bit overblown. Assuming your needs are simple. Archiving is at it is now, a relatively simple process to set up, even without something like PITRTools. Where we run into trouble is when they aren't and that is ok because we can't solve every problem. We can only provide tools for others to solve their particular issue. What concerns me is we seem to be trying to make this easy. It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? Yes, we need to make it reliable. We don't need to be the Nanny database. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Wed, Jun 12, 2013 at 6:03 PM, Joshua D. Drake j...@commandprompt.com wrote: Right now you have to be a rocket scientist no matter what configuration you're running. This is quite a bit overblown. Assuming your needs are simple. Archiving is at it is now, a relatively simple process to set up, even without something like PITRTools. Where we run into trouble is when they aren't and that is ok because we can't solve every problem. We can only provide tools for others to solve their particular issue. What concerns me is we seem to be trying to make this easy. It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? Yes, we need to make it reliable. We don't need to be the Nanny database. More than easy, it should be obvious. Obvious doesn't mean easy, it just means what you have to do to get it right is clearly in front of you. When you give people the freedom of an archive command, you also take away any guidance more restricting options give. I think the point here is that a default would guide people in how to make this work reliably, without having to rediscover it every time. A good, *obvious* (not easy) default. Even cp blah to NFS mount is obvious, while not easy (setting up an NFS through firewalls is never easy). So, having archive utilities in place of cp would ease the burden of administration, because it'd be based on collective knowledge. Some pg_cp (or more likely pg_archive_wal) could check there's enough space, and whatever else collective knowledge decided is necessary. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Craig Ringer cr...@2ndquadrant.com The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. Although I still lack understanding of PostgreSQL implementation, I have an optimistic feeling that such complexity would not be required. While a tablespace is offline, subsequent access from new or existing transactions is rejected with an error tablespace is offline. So new WAL records would not be generated for the offline tablespace. To take the tablespace back online, the DBA performs per-tablespace archive recovery. Per-tablespace archive recovery restores tablespace data files from the backup, then read through archive and pg_xlog/ WAL as usual, and selectively applies WAL records for the tablespace. I don't think it's a must-be-fixed problem that the WAL for all tablespaces is mixed in one location. I suppose we can tolerate that archive recovery takes a long time. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. Per-tablespace WAL is very interesting for another reason -- massive-scale OLTP for database consolidation. This feature would certainly be a breakthrough for amazing performance, because WAL is usually the last bottleneck in OLTP. Yes, I can imagine recovery would be much, much more complicated,. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. As I mentioned above, in my shallow understanding, it seems that the additional complexities can be controlled. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby. Not pretty, but at least there's the option of using a sync standby and avoiding data loss. Sync standby... maybe. Let me consider this. How would you approach this? Thanks Craig, you gave me some interesting insights. All of these topics are interesting, and I'd like to work on them when I have acquired enough knowledge and experience in PostgreSQL development. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Josh, Daniel, Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. This is an outright falsehood. We are telling them, You better know what you are doing or You should call a consultant. This is no different than, You better know what you are doing or You should take driving lessons. What I'm pointing out is that there is no simple case for archiving the way we have it set up. That is, every possible way to deploy PITR for Postgres involves complex, error-prone configuration, setup, and monitoring. I don't think that's necessary; simple cases should have simple solutions. If you do a quick survey of pgsql-general, you will see that the issue of databases shutting down unexpectedly due to archiving running them out of disk space is a very common problem. People shouldn't be afraid of their backup solutions. I'd agree that one possible answer for this is to just get one of the external tools simplified, well-packaged, distributed, instrumented for common monitoring systems, and referenced in our main documentation. I'd say Barman is the closest to a simple solution for the simple common case, at least for PITR. I've been able to give some clients Barman and have them deploy it themselves. This isn't true of the other tools I've tried. Too bad it's GPL, and doesn't do archiving-for-streaming. I have a clear bias in experience here, but I can't relate to someone who sets up archives but is totally okay losing a segment unceremoniously, because it only takes one of those once in a while to make a really, really bad day. Who is this person that lackadaisically archives, and are they just fooling themselves? And where are these archivers that If WAL archiving is your *second* level of redundancy, you will generally be willing to have it break rather than interfere with the production workload. This is particularly the case if you're using archiving just as a backup for streaming replication. Heck, I've had one client where archiving was being used *only* to spin up staging servers, and not for production at all; do you think they wanted production to shut down if they ran out of archive space (which it did)? I'll also point out that archiving can silently fail for a number of reasons having nothing to do with safety options, such as an NFS mount in Linux silently going away (I've also had this happen), or network issues causing file corruption. Which just points out that we need better ways to detect gaps/corruption in archiving. Anyway, what I'm pointing out is that this is a business decision, and there is no way that we can make a decision for the users what to do when we run out of WAL space. And that the stop archiving option needs to be there for users, as well as the shut down option. *without* requiring users to learn the internals of the archiving system to implement it, or to know the implied effects of non-obvious PostgreSQL settings. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 11:07 AM, Joshua D. Drake j...@commandprompt.comwrote: On 06/08/2013 07:36 AM, MauMau wrote: 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. Should that be changed? If the file is 16MB but it turns to gibberish after 3MB, recovery proceeds up to the gibberish. Given that, why should it refuse to start if the file is only 3MB to start with? The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. This will overwrite /archive/dir/%f if it already exists, which is usually recommended against. Although I don't know that it necessarily should be. One common problem with archiving is for a network glitch to occur during the archive command, so the archive command fails and tries again later. But the later tries will always fail, because the target was created before/during the glitch. Perhaps a more full featured archive command would detect and rename an existing file, rather than either overwriting it or failing. If we have no compunction about overwriting the file, then I don't see a reason to use the cp + mv combination. If the simple cp fails to copy the entire file, it will be tried again until it succeeds. Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that. Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that. The example with cp is explicitly given as an example, not a recommendation. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. Wait, what? Is this true (someone else?) It is kind of true. PostgreSQL does not automatically arrange for the stderr of the archive_command to be sent to syslog. But archive_command can do whatever it wants, including arranging for its own failure messages to go to syslog. Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Mon, Jun 10, 2013 at 11:59 AM, Josh Berkus j...@agliodbs.com wrote: Anyway, what I'm pointing out is that this is a business decision, and there is no way that we can make a decision for the users what to do when we run out of WAL space. And that the stop archiving option needs to be there for users, as well as the shut down option. *without* requiring users to learn the internals of the archiving system to implement it, or to know the implied effects of non-obvious PostgreSQL settings. I don't doubt this, that's why I do have a no-op fallback for emergencies. The discussion was about defaults. I still think that drop-wal-from-archiving-whenever is not a good one. You may have noticed I also wrote that a neater, common way to drop WAL when under pressure might be nice, to avoid having it ad-hoc and all over, so it's not as though I wanted to suggest an Postgres feature to this effect was an anti-feature. And, as I wrote before, it's much easier to teach an external system to drop WAL than it is to teach Postgres to attenuate, hence the repeated correspondence from my fellows and myself about attenuation side of the equation. Hope that clears things up about where I stand on the matter. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Daniel, Jeff, I don't doubt this, that's why I do have a no-op fallback for emergencies. The discussion was about defaults. I still think that drop-wal-from-archiving-whenever is not a good one. Yeah, we can argue defaults for a long time. What would be better is some way to actually determine what the user is trying to do, or wants to happen. That's why I'd be in favor of an explict setting; if there's a setting which says: on_archive_failure=shutdown ... then it's a LOT clearer to the user what will happen if the archive runs out of space, even if we make no change to the defaults. And if that setting is changeable on reload, it even becomes a way for users to get out of tight spots. You may have noticed I also wrote that a neater, common way to drop WAL when under pressure might be nice, to avoid having it ad-hoc and all over, so it's not as though I wanted to suggest an Postgres feature to this effect was an anti-feature. Yep. Drake was saying it was an anti-feature, though, so I was arguing with him. Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that. Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that. The example with cp is explicitly given as an example, not a recommendation. If we offer cp as an example, we *are* recommending it. If we don't recommend it, we shouldn't have it as an example. In fact, if we don't recommend cp, then PostgreSQL should ship with some example shell scripts for archive commands, just as we do for init scripts. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Mon, Jun 10, 2013 at 4:42 PM, Josh Berkus j...@agliodbs.com wrote: Daniel, Jeff, I don't doubt this, that's why I do have a no-op fallback for emergencies. The discussion was about defaults. I still think that drop-wal-from-archiving-whenever is not a good one. Yeah, we can argue defaults for a long time. What would be better is some way to actually determine what the user is trying to do, or wants to happen. That's why I'd be in favor of an explict setting; if there's a setting which says: on_archive_failure=shutdown ... then it's a LOT clearer to the user what will happen if the archive runs out of space, even if we make no change to the defaults. And if that setting is changeable on reload, it even becomes a way for users to get out of tight spots. I like your suggestion, save one thing: it's not a 'failure' or archiving if it cannot keep up, provided one subscribes to the view that archiving is not elective. I nit pick at this because one might think this has something to do with a non-zero return code from the archiving program, which already has a pretty alarmist message in event of transient failures (I think someone brought this up on -hackers but a few months ago...can't remember if that resulted in a change). I don't have a better suggestion that is less jargonrific though, but I wanted to express my general appreciation as to the shape of the suggestion. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/10/2013 04:42 PM, Josh Berkus wrote: Actually we describe what archive_command needs to fulfill, and tell them to use something that accomplishes that. The example with cp is explicitly given as an example, not a recommendation. If we offer cp as an example, we *are* recommending it. If we don't recommend it, we shouldn't have it as an example. In fact, if we don't recommend cp, then PostgreSQL should ship with some example shell scripts for archive commands, just as we do for init scripts. Not a bad idea. One that supports rsync and another that supports robocopy. That should cover every platform we support. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Not a bad idea. One that supports rsync and another that supports robocopy. That should cover every platform we support. Example script: = #!/usr/bin/env bash # Simple script to copy WAL archives from one server to another # to be called as archive_command (call this as wal_archive %p %f) # Settings. Please change the below to match your configuration. # holding directory for the archived log files on the replica # this is NOT pg_xlog: WALDIR=/var/lib/pgsql/archive_logs # touch file to shut off archiving in case of filling up the disk: NOARCHIVE=/var/lib/pgsql/NOARCHIVE # replica IP, IPv6 or DNS address: REPLICA=192.168.1.3 # put any special SSH options here, # and the location of RSYNC: export RSYNC_RSH=ssh RSYNC=/usr/bin/rsync DO NOT CHANGE THINGS BELOW THIS LINE ## SOURCE=$1 # %p FILE=$2 # %f DEST=${WALDIR}/${FILE} # See whether we want all archiving off test -f ${NOARCHIVE} exit 0 # Copy the file to the spool area on the replica, error out if # the transfer fails ${RSYNC} --quiet --archive --rsync-path=${RSYNC} ${SOURCE} \ ${REPLICA}:${DEST} if [ $? -ne 0 ]; then exit 1 fi -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/09/2013 08:32 AM, MauMau wrote: - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. ... which is why tablespaces aren't disposable, and why creating a tablespace in a RAM disk is such an awful idea. I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I've seen a couple of people bitten by the misunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I really need to write a docs patch for http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html that adds a prominent warning like: WARNING: Every tablespace must be present before the database can be started. There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile storage, etc. bDo not put a tablespace on a RAM disk/b; instead just use UNLOGGED tables. (Opinions on the above?) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/09/2013 03:02 AM, Jeff Janes wrote: It would be nice to have the ability to specify multiple log destinations with different log_min_messages for each one. I'm sure syslog already must implement some kind of method for doing that, but I've been happy enough with the text logs that I've never bothered to look into it much. Smarter syslog flavours like rsyslog certainly do this. No alert system triggered by events within Pg will ever be fully sufficient. Oops, the postmaster crashed with stack corruption, I'll just exec whatever's in this on_panic_exec GUC (if I can still read it and it's still valid) to hopefully tell the sysadmin about my death. Hmm, sounds as reliable and safe as a bicycle powered by a home-made rocket engine. External monitoring is IMO always necessary. Something like Icinga with check_postgres can trivially poke Pg to make sure it's alive. It can also efficiently check the 'pg_error.log' from rsyslog that contains only severe errors and raise alerts if it doesn't like what it sees. If I'm already doing external monitoring (which is necessary as outlined above) then I see much less point having Pg able to raise alerts for problems, and am more interested in better built-in functions and views for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow the master if async replicas or archiving are getting too far behind, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote: At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. Does this preclude (sorry I don't know this part of the code very well) my suggestion of on log create? Well, yes. We create log segments some layers below XLogInsert() if necesary, and as I said above, we're in a critical section at that point, so just rolling back isn't one of the options. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Craig Ringer cr...@2ndquadrant.com On 06/09/2013 08:32 AM, MauMau wrote: - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. I've seen a couple of people bitten by the misunderstanding that tablespaces are a way to split up your data based on different reliability requirements, and I really need to write a docs patch for http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html that adds a prominent warning like: WARNING: Every tablespace must be present before the database can be started. There is no easy way to recover the database if a tablespace is lost to disk failure, deletion, use of volatile storage, etc. bDo not put a tablespace on a RAM disk/b; instead just use UNLOGGED tables. (Opinions on the above?) Yes, I'm sure this is useful for DBAs to know how postgres behaves and take some preparations. However, this does not apply to my case, because I'm using tablespaces for I/O distribution across multiple disks and simply for database capacity. The problem is that the reliability of the database system decreases with more disks, because failure of any one of those disks would result in a database PANIC shutdown I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I'm in favor of taking a tablespace offline when I/O failure is encountered, and continue running the database server. But WAL must not be discarded because committed transactions must be preserved for durability of ACID. Postgres needs to take these steps when it encounters an I/O error: 1. Take the tablespace offline, so that subsequent read/write against it returns an error without actually issuing read/write against data files. 2. Discard shared buffers containing data in the tablespace. WAL is not affected by the offlining of tablespaces. WAL records already written on the WAL buffer will be written to pg_xlog/ and archived as usual. Those WAL records will be used to recover committed transactions during archive recovery. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/10/2013 06:39 AM, MauMau wrote: The problem is that the reliability of the database system decreases with more disks, because failure of any one of those disks would result in a database PANIC shutdown More specifically, with more independent sets of disks / file systems. I'd rather like to be able to recover from this by treating the tablespace as dead, so any attempt to get a lock on any table within it fails with an error and already-in-WAL writes to it just get discarded. It's the sort of thing that'd only be reasonable to do as a recovery option (like zero_damaged_pages) since if applied by default it'd lead to potentially severe and unexpected data loss. I'm in favor of taking a tablespace offline when I/O failure is encountered, and continue running the database server. But WAL must not be discarded because committed transactions must be preserved for durability of ACID. [snip] WAL is not affected by the offlining of tablespaces. WAL records already written on the WAL buffer will be written to pg_xlog/ and archived as usual. Those WAL records will be used to recover committed transactions during archive recovery. (I'm still learning the details of Pg's WAL, WAL replay and recovery, so the below's just my understanding): The problem is that WAL for all tablespaces is mixed together in the archives. If you lose your tablespace then you have to keep *all* WAL around and replay *all* of it again when the tablespace comes back online. This would be very inefficient, would require a lot of tricks to cope with applying WAL to a database that has an on-disk state in the future as far as the archives are concerned. It's not as simple as just replaying all WAL all over again - as I understand it, things like CLUSTER or TRUNCATE will result in relfilenodes not being where they're expected to be as far as old WAL archives are concerned. Selective replay would be required, and that leaves the door open to all sorts of new and exciting bugs in areas that'd hardly ever get tested. To solve the massive disk space explosion problem I imagine we'd have to have per-tablespace WAL. That'd cause a *huge* increase in fsync costs and loss of the rather nice property that WAL writes are nice sequential writes. It'd be complicated and probably cause nightmares during recovery, for archive-based replication, etc. The only other thing I can think of is: When a tablespace is offline, write WAL records to a separate tablespace recovery log as they're encountered. Replay this log when the tablespace comes is restored, before applying any other new WAL to the tablespace. This wouldn't affect archive-based recovery since it'd already have the records from the original WAL. None of these options seem exactly simple or pretty, especially given the additional complexities that'd be involved in allowing WAL records to be applied out-of-order, something that AFAIK _never_h happens at the moment. The key problem, of course, is that this all sounds like a lot of complicated work for a case that's not really supposed to happen. Right now, the answer is your database is unrecoverable, switch to your streaming warm standby and re-seed it from the standby. Not pretty, but at least there's the option of using a sync standby and avoiding data loss. How would you approach this? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Daniel Farina dan...@heroku.com On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus j...@agliodbs.com wrote: Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. At which point most sensible users say no thanks, I'll use something else. Inverted and just as well supported: if you want to not accidentally lose data, you better hire an expensive consultant to check your systems for all sorts of default 'safety = off' features. This being but the hypothetical first one. Furthermore, I see no reason why high quality external archiving software cannot exist. Maybe some even exists already, and no doubt they can be improved and the contract with Postgres enriched to that purpose. Finally, it's not that hard to teach any archiver how to no-op at user-peril, or perhaps Postgres can learn a way to do this expressly to standardize the procedure a bit to ease publicly shared recipes, perhaps. Yes, I feel designing reliable archiving, even for the simplest case - copy WAL to disk, is very difficult. I know there are following three problems if you just follow the PostgreSQL manual. Average users won't notice them. I guess even professional DBAs migrating from other DBMSs won't, either. 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. 2. archive_command dumps core when you run pg_ctl stop -mi. This is because postmaster sends SIGQUIT to all its descendants. The core files accumulate in the data directory, which will be backed up with the database. Of course those core files are garbage. archive_command script needs to catch SIGQUIT and exit. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. I hope PostgreSQL will provide a reliable archiving facility that is ready to use. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/07/2013 12:14 PM, Josh Berkus wrote: Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. This is an outright falsehood. We are telling them, You better know what you are doing or You should call a consultant. This is no different than, You better know what you are doing or You should take driving lessons. At which point most sensible users say no thanks, I'll use something else. Josh I have always admired your flair for dramatics, it almost rivals mine. Users are free to use what they want, some will chose lesser databases. I am ok with that because eventually if PostgreSQL is the right tool, they will come back to us, and PgExperts or CMD or OmniTI or they will know what they are doing and thus don't need us. JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/08/2013 07:36 AM, MauMau wrote: 1. If the machine or postgres crashes while archive_command is copying a WAL file, later archive recovery fails. This is because cp leaves a file of less than 16MB in archive area, and postgres refuses to start when it finds such a small archive WAL file. The solution, which IIRC Tomas san told me here, is to do like cp %p /archive/dir/%f.tmp mv /archive/dir/%f.tmp /archive/dir/%f. Well it seems to me that one of the problems here is we tell people to use copy. We should be telling people to use a command (or supply a command) that is smarter than that. 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. Wait, what? Is this true (someone else?) JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/06/2013 07:52 AM, Heikki Linnakangas wrote: I think it can be made fairly robust otherwise, and the performance impact should be pretty easy to measure with e.g pgbench. Once upon a time in a land far, far away, we expected users to manage their own systems. We had things like soft and hard quotas on disks and last log to find out who was logging into the system. Alas, as far as I know soft and hard quotas are kind of a thing of the past but that doesn't mean that their usefulness has ended. The idea that we PANIC is not just awful, it is stupid. I don't think anyone is going to disagree with that. However, there is a question of what to do instead. I think the idea of sprinkling checks into the higher level code before specific operations is not invalid but I also don't think it is necessary. To me, a more pragmatic approach makes sense. Obviously having some kind of code that checks the space makes sense but I don't know that it needs to be around any operation other than we are creating a segment. What do we care why the segment is being created? If we don't have enough room to create the segment, the transaction rollsback with some OBVIOUS not OBTUSE error. Obviously this could cause a ton of transactions to roll back but I think keeping the database consistent and rolling back a transaction in case of error is exactly what we are supposed to do. Sincerely, Joshua D. Drake - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote: To me, a more pragmatic approach makes sense. Obviously having some kind of code that checks the space makes sense but I don't know that it needs to be around any operation other than we are creating a segment. What do we care why the segment is being created? If we don't have enough room to create the segment, the transaction rollsback with some OBVIOUS not OBTUSE error. Obviously this could cause a ton of transactions to roll back but I think keeping the database consistent and rolling back a transaction in case of error is exactly what we are supposed to do. You know, the PANIC isn't there just because we like to piss of users. There's actual technical reasons that don't just go away by judging the PANIC as stupid. At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-07 12:02:57 +0300, Heikki Linnakangas wrote: On 07.06.2013 00:38, Andres Freund wrote: On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: * Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. That's not a bad technique. I wonder how reliable it would be in postgres. That's no different from just having a bit more WAL space in the first place. We need a mechanism to stop backends from writing WAL, before you run out of it completely. It doesn't matter if the reservation is done by stashing away a WAL segment for emergency use, or by a variable in shared memory. Either way, backends need to stop using it up, by blocking or throwing an error before they enter the critical section. Well, if you have 16 or 32MB of reserved WAL space available you don't need to judge all that precisely how much space is available. So we can just sprinkle some EnsureXLogHasSpace() on XLogInsert() callsites like heap_insert(), but we can do that outside of the critical sections and we can do it without locks since there needs to happen quite some write activity to overrun the reserved space. Anything that desparately needs to write stuff, like the end of recovery checkpoint, can just not call EnsureXLogHasSpace() and rely on the reserved space. Seems like 90% of the solution for 30% of the complexity or so. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus j...@agliodbs.com wrote: The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. You're talking about using external tools -- frequently hackish, workaround ones -- to handle something which PostgreSQL should be doing itself, and which only the database engine has full knowledge of. I think the database engine is about the last thing which would have full knowledge of the best way to contact the DBA, especially during events which by definition mean things are already going badly. I certainly don't see having core code which knows how to talk to every PBX, SMS, email system, or twitter feed that anyone might wish to use for logging. PostgreSQL already supports two formats of text logs, plus syslog and eventlog. Is there some additional logging management tool that we could support which is widely used, doesn't require an expensive consultant to set-up and configure correctly (or even to decide what correctly means for the given situation), and which solves 80% of the problems? It would be nice to have the ability to specify multiple log destinations with different log_min_messages for each one. I'm sure syslog already must implement some kind of method for doing that, but I've been happy enough with the text logs that I've never bothered to look into it much. While that's the only solution we have for now, it's hardly a worthy design goal. Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. At which point most sensible users say no thanks, I'll use something else. What does the something else do? Hopefully it is not silently invalidate your backups. Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 11:15 AM, Joshua D. Drake j...@commandprompt.comwrote: On 06/06/2013 07:52 AM, Heikki Linnakangas wrote: I think it can be made fairly robust otherwise, and the performance impact should be pretty easy to measure with e.g pgbench. Once upon a time in a land far, far away, we expected users to manage their own systems. We had things like soft and hard quotas on disks and last log to find out who was logging into the system. Alas, as far as I know soft and hard quotas are kind of a thing of the past but that doesn't mean that their usefulness has ended. The idea that we PANIC is not just awful, it is stupid. I don't think anyone is going to disagree with that. However, there is a question of what to do instead. I think the idea of sprinkling checks into the higher level code before specific operations is not invalid but I also don't think it is necessary. Given that the system is going to become unusable, I don't see why PANIC is an awful, stupid way of doing it. And if it can only be used for things that don't generate WAL, that is pretty much unusable, as even read only transactions often need to do clean-up tasks that generate WAL. Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/08/2013 11:27 AM, Andres Freund wrote: On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote: To me, a more pragmatic approach makes sense. Obviously having some kind of code that checks the space makes sense but I don't know that it needs to be around any operation other than we are creating a segment. What do we care why the segment is being created? If we don't have enough room to create the segment, the transaction rollsback with some OBVIOUS not OBTUSE error. Obviously this could cause a ton of transactions to roll back but I think keeping the database consistent and rolling back a transaction in case of error is exactly what we are supposed to do. You know, the PANIC isn't there just because we like to piss of users. There's actual technical reasons that don't just go away by judging the PANIC as stupid. Yes I know we aren't trying to piss off users. What I am saying is that it is stupid to the user that it PANICS. I apologize if that came out wrong. At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. Does this preclude (sorry I don't know this part of the code very well) my suggestion of on log create? JD Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 7 June 2013 10:02, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 07.06.2013 00:38, Andres Freund wrote: On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: * Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. That's not a bad technique. I wonder how reliable it would be in postgres. That's no different from just having a bit more WAL space in the first place. We need a mechanism to stop backends from writing WAL, before you run out of it completely. It doesn't matter if the reservation is done by stashing away a WAL segment for emergency use, or by a variable in shared memory. Either way, backends need to stop using it up, by blocking or throwing an error before they enter the critical section. I guess you could use the stashed away segment to ensure that you can recover after PANIC. At recovery, there are no other backends that could use up the emergency segment. But that's not much better than what we have now. Christian's idea seems good to me. Looks like you could be dismissing this too early, especially since there's no better idea emerged. I doubt that we're going to think of something others didn't already face. It's pretty clear that most other DBMS do this with their logs. For your fast wal insert patch to work well, we need a simple and fast technique to detect out of space. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Sat, Jun 8, 2013 at 11:27 AM, Andres Freund and...@2ndquadrant.comwrote: You know, the PANIC isn't there just because we like to piss of users. There's actual technical reasons that don't just go away by judging the PANIC as stupid. At the points where the XLogInsert()s happens we're in critical sections out of which we *cannot* ERROR out because we already may have made modifications that cannot be allowed to be performed partially/unlogged. That's why we're throwing a PANIC which will force a cluster wide restart including *NOT* writing any further buffers from s_b out. If archiving is on and failure is due to no space, could we just keep trying XLogFileInit again for a couple minutes to give archiving a chance to do its things? Doing that while holding onto locks and a critical section would be unfortunate, but if the alternative is a PANIC, it might be acceptable. The problem is that even if the file is only being kept so it can be archived, once archiving succeeds I think the file is not removed immediately but rather not until the next checkpoint, which will never happen when the locks are still held. Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Joshua D. Drake j...@commandprompt.com On 06/08/2013 07:36 AM, MauMau wrote: 3. You cannot know the reason of archive_command failure (e.g. archive area full) if you don't use PostgreSQL's server logging. This is because archive_command failure is not logged in syslog/eventlog. Wait, what? Is this true (someone else?) I'm sorry, please let me correct myself. What I meant is: This is because the exact reason for archive_command failure (e.g. cp's output) is not logged in syslog/eventlog. The fact itself that archive_command failed is recorded. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Joshua D. Drake j...@commandprompt.com On 06/08/2013 11:27 AM, Andres Freund wrote: You know, the PANIC isn't there just because we like to piss of users. There's actual technical reasons that don't just go away by judging the PANIC as stupid. Yes I know we aren't trying to piss off users. What I am saying is that it is stupid to the user that it PANICS. I apologize if that came out wrong. I've experienced PANIC shutdown several times due to WAL full during development. As a user, one problem with PANIC is that it dumps core. I think core files should be dumped only when can't happen events has occurred like PostgreSQL's bug. I didn't expect postgres dumps core simply because disk is full. I want postgres to shutdown with FATAL message in that exact case. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
From: Josh Berkus j...@agliodbs.com There's actually three potential failure cases here: - One Volume: WAL is on the same volume as PGDATA, and that volume is completely out of space. - XLog Partition: WAL is on its own partition/volume, and fills it up. - Archiving: archiving is failing or too slow, causing the disk to fill up with waiting log segments. I think there is one more case. Is this correct? - Failure of a disk containing data directory or tablespace If checkpoint can't write buffers to disk because of disk failure, checkpoint cannot complete, thus WAL files accumulate in pg_xlog/. This means that one disk failure will lead to postgres shutdown. xLog Partition -- As Heikki pointed, out, a full dedicated WAL drive is hard to fix once it gets full, since there's nothing you can safely delete to clear space, even enough for a checkpoint record. This sounds very scary. Is it possible to complete recovery and start up postmaster with either or both of the following modifications? [Idea 1] During recovery, force archiving a WAL file and delete/recycle it in pg_xlog/ as soon as its contents are applied. [Idea 2] During recovery, when disk full is encountered at end-of-recovery checkpoint, force archiving all unarchived WAL files and delete/recycle them at that time. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/06/2013 10:00 PM, Heikki Linnakangas wrote: I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog was on a separate partition that had nothing else on it. The partition filled up, and the system shut down with a PANIC. Because there was no space left, it could not even write the checkpoint after recovery, and thus refused to start up again. There was nothing else on the partition that you could delete to make space. The only recourse would've been to add more disk space to the partition (impossible), or manually delete an old WAL file that was not needed to recover from the latest checkpoint (scary). Fortunately this was a test system, so we just deleted everything. There were a couple of dba.stackexchange.com reports along the same lines recently, too. Both involved an antivirus vendor's VM appliance with a canned (stupid) configuration that set wal_keep_segments too high for the disk space allocated and stored WAL on a separate partition. People are having issues with WAL space management in the real world and I think it and autovacuum are the two hardest things for most people to configure and understand in Pg right now. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/08/2013 10:57 AM, Daniel Farina wrote: At which point most sensible users say no thanks, I'll use something else. [snip] I have a clear bias in experience here, but I can't relate to someone who sets up archives but is totally okay losing a segment unceremoniously, because it only takes one of those once in a while to make a really, really bad day. It sounds like between you both you've come up with a pretty solid argument by exclusion for throttling WAL writing as space grows critical. Dropping archive segments seems pretty unsafe from the solid arguments Daniel presents, and Josh makes a good case for why shutting the DB down when archiving can't keep up isn't any better. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 07.06.2013 00:38, Andres Freund wrote: On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: * Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. That's not a bad technique. I wonder how reliable it would be in postgres. That's no different from just having a bit more WAL space in the first place. We need a mechanism to stop backends from writing WAL, before you run out of it completely. It doesn't matter if the reservation is done by stashing away a WAL segment for emergency use, or by a variable in shared memory. Either way, backends need to stop using it up, by blocking or throwing an error before they enter the critical section. I guess you could use the stashed away segment to ensure that you can recover after PANIC. At recovery, there are no other backends that could use up the emergency segment. But that's not much better than what we have now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
--On 6. Juni 2013 16:25:29 -0700 Josh Berkus j...@agliodbs.com wrote: Archiving - In some ways, this is the simplest case. Really, we just need a way to know when the available WAL space has become 90% full, and abort archiving at that stage. Once we stop attempting to archive, we can clean up the unneeded log segments. What we need is a better way for the DBA to find out that archiving is falling behind when it first starts to fall behind. Tailing the log and examining the rather cryptic error messages we give out isn't very effective. Slightly OT, but i always wondered wether we could create a function, say pg_last_xlog_removed() for example, returning a value suitable to be used to calculate the distance to the current position. An increasing value could be used to instruct monitoring to throw a warning if a certain threshold is exceeded. I've also seen people creating monitoring scripts by looking into archive_status and do simple counts on the .ready files and give a warning, if that exceeds an expected maximum value. I haven't looked at the code very deep, but i think we already store the position of the last removed xlog in shared memory already, maybe this can be used somehow. Afaik, we do cleanup only during checkpoints, so this all has too much delay... -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06.06.2013 17:00, Heikki Linnakangas wrote: A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. Actually, there's one place that catches most of these: LockBuffer(..., BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always grab an exclusive lock on a page first, before entering the critical section where you call XLogInsert. That leaves a few miscellaneous XLogInsert calls that need to be guarded, but it leaves a lot less room for bugs of omission, and keeps the code cleaner. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Heikki Linnakangas hlinnakan...@vmware.com writes: On 06.06.2013 17:00, Heikki Linnakangas wrote: A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. Actually, there's one place that catches most of these: LockBuffer(..., BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always grab an exclusive lock on a page first, before entering the critical section where you call XLogInsert. Not only is that a horrible layering/modularity violation, but surely LockBuffer can have no idea how much WAL space will be needed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 07.06.2013 19:33, Tom Lane wrote: Heikki Linnakangashlinnakan...@vmware.com writes: On 06.06.2013 17:00, Heikki Linnakangas wrote: A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. Actually, there's one place that catches most of these: LockBuffer(..., BUFFER_LOCK_EXCLUSIVE). In all heap and index operations, you always grab an exclusive lock on a page first, before entering the critical section where you call XLogInsert. Not only is that a horrible layering/modularity violation, but surely LockBuffer can have no idea how much WAL space will be needed. It can be just a conservative guess, like, 32KB. That should be enough for almost all WAL-logged operations. The only exception that comes to mind is a commit record, which can be arbitrarily large, when you have a lot of subtransactions or dropped/created relations. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Heikki Linnakangas hlinnakan...@vmware.com writes: On 07.06.2013 19:33, Tom Lane wrote: Not only is that a horrible layering/modularity violation, but surely LockBuffer can have no idea how much WAL space will be needed. It can be just a conservative guess, like, 32KB. That should be enough for almost all WAL-logged operations. The only exception that comes to mind is a commit record, which can be arbitrarily large, when you have a lot of subtransactions or dropped/created relations. What happens when several updates are occurring concurrently? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database. Agreed and I would oppose it even as configurable. We set up the archiving for a reason. I do think it might be useful to be able to store archiving logs as well as wal_keep_segments logs in a different location than pg_xlog. People have different configurations. Most of my clients use archiving for backup or replication; they would rather have archiving cease (and send a CRITICAL alert) than have the master go offline. That's pretty common, probably more common than the if I don't have redundancy shut down case. Certainly anyone making the decision that their master database should shut down rather than cease archiving should make it *consciously*, instead of finding out the hard way. The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. You're talking about using external tools -- frequently hackish, workaround ones -- to handle something which PostgreSQL should be doing itself, and which only the database engine has full knowledge of. While that's the only solution we have for now, it's hardly a worthy design goal. Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. At which point most sensible users say no thanks, I'll use something else. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus j...@agliodbs.com wrote: Right now, what we're telling users is You can have continuous backup with Postgres, but you'd better hire and expensive consultant to set it up for you, or use this external tool of dubious provenance which there's no packages for, or you might accidentally cause your database to shut down in the middle of the night. Inverted and just as well supported: if you want to not accidentally lose data, you better hire an expensive consultant to check your systems for all sorts of default 'safety = off' features. This being but the hypothetical first one. Furthermore, I see no reason why high quality external archiving software cannot exist. Maybe some even exists already, and no doubt they can be improved and the contract with Postgres enriched to that purpose. Contrast: JSON, where the stable OID in the core distribution helps pragmatically punt on a particularly sticky problem (extension dependencies and non-system OIDs), I can't think of a reason an external archiver couldn't do its job well right now. At which point most sensible users say no thanks, I'll use something else. Oh, I lost some disks, well, no big deal, I'll use the archives. Surprise! forensic analysis ensues So, as it turns out, it has been dropping segments at times because of systemic backlog for months/years. Alternative ending: Hey, I restored the database. later Why is the state so old? Why are customers getting warnings that their (thought paid) invoices are overdue? Oh crap, the restore was cut short by this stupid option and this database lives in the past! Fin. I have a clear bias in experience here, but I can't relate to someone who sets up archives but is totally okay losing a segment unceremoniously, because it only takes one of those once in a while to make a really, really bad day. Who is this person that lackadaisically archives, and are they just fooling themselves? And where are these archivers that enjoy even a modicum of long-term success that are not reliable? If one wants to casually drop archives, how is someone going to find out and freak out a bit? Per experience, logs are pretty clearly hazardous to the purpose. Basically, I think the default that opts one into danger is not good, especially since the system is starting from a position of do too much stuff and you'll crash. Finally, it's not that hard to teach any archiver how to no-op at user-peril, or perhaps Postgres can learn a way to do this expressly to standardize the procedure a bit to ease publicly shared recipes, perhaps. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote: A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. I propose that we maintain a WAL reservation system in shared memory. I am rather doubtful that this won't end up with a bunch of complex code that won't prevent the situation in all circumstances but which will provide bugs/performance problems for some time. Obviously that's just gut feeling since I haven't see the code... I am much more excited about getting the soft limit case right and then seeing how many problems remain in reality. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06.06.2013 17:17, Andres Freund wrote: On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote: A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. I propose that we maintain a WAL reservation system in shared memory. I am rather doubtful that this won't end up with a bunch of complex code that won't prevent the situation in all circumstances but which will provide bugs/performance problems for some time. Obviously that's just gut feeling since I haven't see the code... I also have a feeling that we'll likely miss some corner cases in the first cut, so that you can still run out of disk space if you try hard enough / are unlucky. But I think it would still be a big improvement if it only catches, say 90% of the cases. I think it can be made fairly robust otherwise, and the performance impact should be pretty easy to measure with e.g pgbench. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
* Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. -- Christian -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote: * Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. In other words, the idea is not to check over and over again that there is enough already-reserved WAL space, but to make sure there always is by having a preallocated segment that is never used outside a disk space emergency. That's not a bad technique. I wonder how reliable it would be in postgres. Do all filesystems allow a rename() to succeed if there isn't actually any space left? E.g. on btrfs I wouldn't be sure. We need to rename because WAL files need to be named after the LSN timelineid... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Thu, Jun 6, 2013 at 10:38 PM, Andres Freund and...@2ndquadrant.com wrote: That's not a bad technique. I wonder how reliable it would be in postgres. Do all filesystems allow a rename() to succeed if there isn't actually any space left? E.g. on btrfs I wouldn't be sure. We need to rename because WAL files need to be named after the LSN timelineid... I suppose we could just always do the rename at the same time as setting up the current log file. That is, when we start wal log x also set up wal file x+1 at that time. This isn't actually guaranteed to be enough btw. It's possible that the record we're actively about to write will require all of both those files... But that should be very unlikely. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
Let's talk failure cases. There's actually three potential failure cases here: - One Volume: WAL is on the same volume as PGDATA, and that volume is completely out of space. - XLog Partition: WAL is on its own partition/volume, and fills it up. - Archiving: archiving is failing or too slow, causing the disk to fill up with waiting log segments. I'll argue that these three cases need to be dealt with in three different ways, and no single solution is going to work for all three. Archiving - In some ways, this is the simplest case. Really, we just need a way to know when the available WAL space has become 90% full, and abort archiving at that stage. Once we stop attempting to archive, we can clean up the unneeded log segments. What we need is a better way for the DBA to find out that archiving is falling behind when it first starts to fall behind. Tailing the log and examining the rather cryptic error messages we give out isn't very effective. xLog Partition -- As Heikki pointed, out, a full dedicated WAL drive is hard to fix once it gets full, since there's nothing you can safely delete to clear space, even enough for a checkpoint record. On the other hand, it should be easy to prevent full status; we could simply force a non-spread checkpoint whenever the available WAL space gets 90% full. We'd also probably want to be prepared to switch to a read-only mode if we get full enough that there's only room for the checkpoint records. One Volume -- This is the most complicated case, because we wouldn't necessarily run out of space because of WAL using it up. Anything could cause us to run out of disk space, including activity logs, swapping, pgsql_tmp files, database growth, or some other process which writes files. This means that the DBA getting out of disk-full manually is in some ways easier; there's usually stuff she can delete. However, it's much harder -- maybe impossible -- for PostgreSQL to prevent this kind of space outage. There should be things we can do to make it easier for the DBA to troubleshoot this, but I'm not sure what. We could use a hard limit for WAL to prevent WAL from contributing to out-of-space, but that'll only prevent a minority of cases. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Thu, Jun 6, 2013 at 4:28 PM, Christian Ullrich ch...@chrullrich.net wrote: * Heikki Linnakangas wrote: The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can So we need to somehow stop new WAL insertions from happening, before it's too late. A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in There is a database engine, Microsoft's Jet Blue aka the Extensible Storage Engine, that just keeps some preallocated log files around, specifically so it can get consistent and halt cleanly if it runs out of disk space. fwiw, informix (at least until IDS 2000, not sure after that) had the same thing. only this was a parameter to set, and bad things happened if you forgot about it :D -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Thursday, June 6, 2013, Josh Berkus wrote: Let's talk failure cases. There's actually three potential failure cases here: - One Volume: WAL is on the same volume as PGDATA, and that volume is completely out of space. - XLog Partition: WAL is on its own partition/volume, and fills it up. - Archiving: archiving is failing or too slow, causing the disk to fill up with waiting log segments. I'll argue that these three cases need to be dealt with in three different ways, and no single solution is going to work for all three. Archiving - In some ways, this is the simplest case. Really, we just need a way to know when the available WAL space has become 90% full, and abort archiving at that stage. Once we stop attempting to archive, we can clean up the unneeded log segments. I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database. What we need is a better way for the DBA to find out that archiving is falling behind when it first starts to fall behind. Tailing the log and examining the rather cryptic error messages we give out isn't very effective. The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. Of course maybe whatever causes the archive to fail will also cause the delivery of the message to fail, but I don't see a real solution to this that doesn't start down an infinite regress. If it is not failing outright, but merely falling behind, then I don't really know how to go about detecting that, either in archive_command, or through tailing the PostgreSQL log. I guess archive_command, each time it is invoked, could count the files in the pg_xlog directory and warn if it thinks the number is unreasonable. xLog Partition -- As Heikki pointed, out, a full dedicated WAL drive is hard to fix once it gets full, since there's nothing you can safely delete to clear space, even enough for a checkpoint record. Although the DBA probably wouldn't know it from reading the manual, it is almost always safe to delete the oldest WAL file (after copying it to a different partition just in case something goes wrong--it should be possible to do that as if WAL is on its own partition, it is hard to imagine you can't scrounge up 16MB on a different one), as PostgreSQL keeps two complete checkpoints worth of WAL around. I think the only reason you would not be able to recover after removing the oldest file is if the controldata file is damaged such that the most recent checkpoint record cannot be found and so it has to fall back to the previous one. Or at least, this is my understanding. On the other hand, it should be easy to prevent full status; we could simply force a non-spread checkpoint whenever the available WAL space gets 90% full. We'd also probably want to be prepared to switch to a read-only mode if we get full enough that there's only room for the checkpoint records. I think that that last sentence could also be applied without modification to the one volume case as well. So what would that look like? Before accepting a (non-checkpoint) WAL Insert that fills up the current segment to a high enough level that a checkpoint record will no longer fit, it must first verify that a recycled file exists, or if not it must successfully init a new file. If that init fails, then it must do what? Signal for a checkpoint, release it's locks, and then ERROR out? That would be better than a PANIC, but can it do better? Enter a retry loop so that once the checkpoint has finished and assuming it has freed up enough WAL files to recycling/removal, then it can try the original WAL Insert again? Cheers, Jeff
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On 06/06/2013 09:30 PM, Jeff Janes wrote: Archiving - In some ways, this is the simplest case. Really, we just need a way to know when the available WAL space has become 90% full, and abort archiving at that stage. Once we stop attempting to archive, we can clean up the unneeded log segments. I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database. Agreed and I would oppose it even as configurable. We set up the archiving for a reason. I do think it might be useful to be able to store archiving logs as well as wal_keep_segments logs in a different location than pg_xlog. What we need is a better way for the DBA to find out that archiving is falling behind when it first starts to fall behind. Tailing the log and examining the rather cryptic error messages we give out isn't very effective. The archive command can be made a shell script (or that matter a compiled program) which can do anything it wants upon failure, including emailing people. Yep, that is what PITRTools does. You can make it do whatever you want. JD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Thu, Jun 6, 2013 at 9:30 PM, Jeff Janes jeff.ja...@gmail.com wrote: I would oppose that as the solution, either an unconditional one, or configurable with is it as the default. Those segments are not unneeded. I need them. That is why I set up archiving in the first place. If you need to shut down the database rather than violate my established retention policy, then shut down the database. Same boat. My archives are the real storage. The disks are write-back caching. That's because the storage of my archives is probably three to five orders of magnitude more reliable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers