subject:"Re\: \[HACKERS\] Hard limit on WAL space used \(because PANIC sucks\)"

On 22 January 2014 01:23, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
 Another thing to think about is whether we couldn't put a hard limit on
 WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
 design anyway, when you get right down to it.  So for example maybe we
 could split up commit records, with most of the bulky information dumped
 into separate records that appear before the real commit.  This would
 complicate replay --- in particular, if we abort the transaction after
 writing a few such records, how does the replayer realize that it can
 forget about those records?  But that sounds probably surmountable.

 I think removing the list of subtransactions from commit records would
 essentially require not truncating pg_subtrans after a restart
 anymore.

 I'm not suggesting that we stop providing that information!  I'm just
 saying that we perhaps don't need to store it all in one WAL record,
 if instead we put the onus on WAL replay to be able to reconstruct what
 it needs from a series of WAL records.

I think removing excess subxacts from commit and abort records could
be a good idea. Not sure anybody considered it before.

We already emit xid allocation records when we overflow, so we already
know which subxids have been allocated. We also issue subxact abort
records for anything that aborted. So in theory we should be able to
reconstruct an arbitrarily long chain of subxacts.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 22 January 2014 01:30, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 How are we supposed to wait while e.g. ProcArrayLock? Aborting
 transactions doesn't work either, that writes abort records which can
 get signficantly large.

 Yeah, that's an interesting point ;-).  We can't *either* commit or abort
 without emitting some WAL, possibly quite a bit of WAL.

Right, which is why we don't need to lock ProcArrayLock. As soon as we
try to write a commit or abort it goes through the normal XLogInsert
route. As soon as wal_buffers fills WALWriteLock will be held
continuously until we free some space.

Since ProcArrayLock isn't held, read-only users can continue.

As Jeff points out, the blocks being modified would be locked until
space is freed up. Which could make other users wait. The code
required to avoid that wait would be complex and not worth any
overhead.

Note that my proposal would not require aborting any in-flight
transactions; they would continue to completion as soon as space is
cleared.

My proposal again, so we can review how simple it was...

1. Allow a checkpoint to complete by updating the control file, rather
than writing WAL. The control file is already there and is fixed size,
so we can be more confident it will accept the update. We could add a
new checkpoint mode for that, or we could do that always for shutdown
checkpoints (my preferred option). EFFECT: Since a checkpoint can now
be called and complete without writing WAL, we are able to write dirty
buffers and then clean out WAL files to reduce space.

2. If we fill the disk when writing WAL we do not PANIC, we signal the
checkpointer process to perform an immediate checkpoint and then wait
for its completion. EFFECT: Since we are holding WALWriteLock, all
other write users will soon either wait for that lock directly or
indirectly.

Both of those points are relatively straightforward to implement and
this proposal minimises seldom-tested code paths.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Heikki Linnakangas


On 01/22/2014 02:10 PM, Simon Riggs wrote:

As Jeff points out, the blocks being modified would be locked until
space is freed up. Which could make other users wait. The code
required to avoid that wait would be complex and not worth any
overhead.


Checkpoint also acquires the content lock of every dirty page in the 
buffer cache...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 01/22/2014 02:10 PM, Simon Riggs wrote:

 As Jeff points out, the blocks being modified would be locked until
 space is freed up. Which could make other users wait. The code
 required to avoid that wait would be complex and not worth any
 overhead.


 Checkpoint also acquires the content lock of every dirty page in the buffer
 cache...

Good point. We would need to take special action for any dirty blocks
that we cannot obtain content lock for, which should be a smallish
list, to be dealt with right at the end of the checkpoint writes.

We know that anyone waiting for the WAL lock will not be modifying the
block and so we can copy it without obtaining the lock. We can inspect
the lock queue on the WAL locks and then see which buffers we can skip
the lock for.

The alternative of adding a check for WAL space to the normal path is
a non-starter, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:

 Well, PANIC is certainly bad, but what I'm suggesting is that we
 just focus on getting that down to ERROR and not worry about
 trying to get out of the disk-shortage situation automatically.
 Nor do I believe that it's such a good idea to have the database
 freeze up until space appears rather than reporting errors.

Dusting off my DBA hat for a moment, I would say that I would be
happy if each process either generated an ERROR or went into a wait
state.  They would not all need to do the same thing; whichever is
easier in each process's context would do.  It would be nice if a
process which went into a long wait state until disk space became
available would issue a WARNING about that, where that is possible.

I feel that anyone running a database that matters to them should
be monitoring disk space and getting an alert from the monitoring
system in advance of actually running out of disk space; but we all
know that poorly managed systems are out there.  To accomodate them
we don't want to add risk or performance hits for those who do
manage their systems well.

The attempt to make more space by deleting WAL files scares me a
bit.  The dynamics of that seem like they could be
counterproductive if the pg_xlog directory is on the same
filesystem as everything else and there is a rapidly growing file. 
Every time you cleaned up, the fast-growing file would eat more of
the space pg_xlog had been using, until perhaps it had no space to
keep even a single segment, no?  And how would that work with a
situation where the archive_command was failing, causing a build-up
WAL files?  It just seems like too much mechanism and incomplete
coverage of the problem space.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Andres Freund

On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
  I don't think that's a comparable case.  Incomplete actions are actions
  to be taken immediately, and which the replayer then has to complete
  somehow if it doesn't find the rest of the action in the WAL sequence.
  The only thing to be done with the records I'm proposing is to remember
  their contents (in some fashion) until it's time to apply them.  If you
  hit end of WAL you don't really have to do anything.
 
  Would that work for the promotion case as well? Afair there's the
  assumption that everything = TransactionXmin can be looked up in
  pg_subtrans or in the procarray - which afaics wouldn't be the case with
  your scheme? And TransactionXmin could very well be below such an
  incomplete commit's xids afaics.
 
 Uh, what?  The behavior I'm talking about is *exactly the same*
 as what happens now.  The only change is that the data sent to the
 WAL file is laid out a bit differently, and the replay logic has
 to work harder to reassemble it before it can apply the commit or
 abort action.  If anything outside replay can detect a difference
 at all, that would be a bug.
 
 Once again: the replayer is not supposed to act immediately on the
 subsidiary records.  It's just supposed to remember their contents
 so it can reattach them to the eventual commit or abort record,
 and then do what it does today to replay the commit or abort.

I (think) I get what you want to do, but splitting the record like that
nonetheless opens up behaviour that previously wasn't there. Imagine we
promote inbetween replaying the list of subxacts (only storing it in
memory) and the main commit record. Either we have something like the
incomplete action stuff doing something with the in-memory data, or we
are in a situation where there can be xids bigger than TransactionXmin
that are not in pg_subtrans and not in the procarray. Which I don't
think exists today since we either read the commit record in it's
entirety or not.
We'd also need to use the MyPgXact-delayChkpt mechanism to prevent
checkpoints from occuring inbetween those records, but we do that
already, so that seems rather uncontroversial.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 21:42:19 -0500, Tom Lane wrote:
 Uh, what?  The behavior I'm talking about is *exactly the same*
 as what happens now.  The only change is that the data sent to the
 WAL file is laid out a bit differently, and the replay logic has
 to work harder to reassemble it before it can apply the commit or
 abort action.  If anything outside replay can detect a difference
 at all, that would be a bug.
 
 Once again: the replayer is not supposed to act immediately on the
 subsidiary records.  It's just supposed to remember their contents
 so it can reattach them to the eventual commit or abort record,
 and then do what it does today to replay the commit or abort.

 I (think) I get what you want to do, but splitting the record like that
 nonetheless opens up behaviour that previously wasn't there.

Obviously we are not on the same page yet.

In my vision, the WAL writer is dumping the same data it would have
dumped, though in a different layout, and it's working from process-local
state same as it does now.  The WAL replayer is taking the same actions at
the same time using the same data as it does now.  There is no behavior
that wasn't there, unless you're claiming that there are *existing* race
conditions in commit/abort WAL processing.

The only thing that seems mildly squishy about this is that it's not clear
how long the WAL replayer ought to hang onto subsidiary records for a
commit or abort it hasn't seen yet.  In the case where we change our minds
and abort a transaction after already having written some subsidiary
records for the commit, it's not really a problem; the replayer can throw
away any saved data related to the commit of xid N as soon as it sees an
abort for xid N.  However, what if the session crashes and never writes
either a final commit or abort record?  I think we can deal with this
fairly easily though, because that case should end with a crash recovery
cycle writing a shutdown checkpoint to the log (we do do that no?).
So the rule can be discard any unmatched subsidiary records if you see a
shutdown checkpoint.  This makes sense on its own terms since there are
surely no active transactions at that point in the log.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 22 January 2014 14:25, Simon Riggs si...@2ndquadrant.com wrote:
 On 22 January 2014 13:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 01/22/2014 02:10 PM, Simon Riggs wrote:

 As Jeff points out, the blocks being modified would be locked until
 space is freed up. Which could make other users wait. The code
 required to avoid that wait would be complex and not worth any
 overhead.


 Checkpoint also acquires the content lock of every dirty page in the buffer
 cache...

 Good point. We would need to take special action for any dirty blocks
 that we cannot obtain content lock for, which should be a smallish
 list, to be dealt with right at the end of the checkpoint writes.

 We know that anyone waiting for the WAL lock will not be modifying the
 block and so we can copy it without obtaining the lock. We can inspect
 the lock queue on the WAL locks and then see which buffers we can skip
 the lock for.

This could be handled similarly to the way we handle buffer pin
deadlocks in Hot Standby.

So I don't see any blockers from that angle.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-22 Thread Jim Nasby


On 1/21/14, 6:46 PM, Andres Freund wrote:

On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:

On Tue, Jan 21, 2014 at 3:43 PM, Andres Freundand...@2ndquadrant.com  wrote:

 I personally think this isn't worth complicating the code for.


You're probably right. However, I don't see why the bar has to be very
high when we're considering the trade-off between taking some
emergency precaution against having a PANIC shutdown, and an assured
PANIC shutdown

Well, the problem is that the tradeoff would very likely include making
already complex code even more complex. None of the proposals, even the
one just decreasing the likelihood of a PANIC, like like they'd end up
being simple implementation-wise.
And that additional complexity would hurt robustness and prevent things
I find much more important than this.


If we're not looking for perfection, what's wrong with Peter's idea of a 
ballast file? Presumably the check to see if that file still exists would be 
cheap so we can do that before entering the appropriate critical section.

There's still a small chance that we'd end up panicing, but it's better than 
today. I'd argue that even if it doesn't work for CoW filesystems it'd still be 
a win.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Simon Riggs

On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote:
In the Redesigning checkpoint_segments thread, many people opined that
there should be a hard limit on the amount of disk space used for WAL:
http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6s4-ppd+6jiy4oq78i...@mail.gmail.com.
I'm starting a new thread on that, because that's mostly orthogonal to
redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can try to
avoid that by checkpointing early enough, so that we can remove old WAL
segments to make room for new ones before you run out, but unless we somehow
throttle or stop new WAL insertions, it's always going to be possible to use
up all disk space. A typical scenario where that happens is when
archive_command fails for some reason; even a checkpoint can't remove old,
unarchived segments in that case. But it can happen even without WAL
archiving.

I don't see we need to prevent WAL insertions when the disk fills. We
still have the whole of wal_buffers to use up. When that is full, we
will prevent further WAL insertions because we will be holding the
WALwritelock to clear more space. So the rest of the system will lock
up nicely, like we want, apart from read-only transactions.

Instead of PANICing, we should simply signal the checkpointer to
perform a shutdown checkpoint. That normally requires a WAL insertion
to complete, but it seems easy enough to make that happen by simply
rewriting the control file, after which ALL WAL files are superfluous
for crash recovery and can be deleted. Once that checkpoint is
complete, we can begin deleting WAL files that are archived/replicated
and continue as normal. The previously failing WAL write can now be
made again and may succeed this time - if it does, we continue, if not
- now we PANIC.

Note that this would not require in-progress transactions to be
aborted. They can continue normally once wal_buffers re-opens.

We don't really want anything too drastic, because if this situation
happens once it may happen many times - I'm imagining a flaky network
etc.. So we want the situation to recover quickly and easily, without
too many consequences.

The above appears to be very minimal change from existing code and
doesn't introduce lots of new points of breakage.

I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog
was on a separate partition that had nothing else on it. The partition
filled up, and the system shut down with a PANIC. Because there was no space
left, it could not even write the checkpoint after recovery, and thus
refused to start up again. There was nothing else on the partition that you
could delete to make space. The only recourse would've been to add more disk
space to the partition (impossible), or manually delete an old WAL file that
was not needed to recover from the latest checkpoint (scary). Fortunately
this was a test system, so we just deleted everything.

Doing shutdown checkpoints via the control file would exactly solve
that issue. We already depend upon the readability of the control file
anyway, so this changes nothing. (And if you regard it does, then we
can have multiple control files, or at least a backup control file at
shutdown).

We can make the shutdown checkpoint happen always at EOF of a WAL
segment, so at shutdown we don't need any WAL files to remain at all.

So we need to somehow stop new WAL insertions from happening, before it's
too late.

I don't think we do.

What might be sensible is to have checkpoints speed up as WAL volume
approaches a predefined limit, so that we minimise the delay caused
when wal_buffers locks up.

Not suggesting anything here for 9.4, since we're midCF.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Simon Riggs si...@2ndquadrant.com writes:
 On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 The current situation is that if you run out of disk space while writing
 WAL, you get a PANIC, and the server shuts down. That's awful.

 I don't see we need to prevent WAL insertions when the disk fills. We
 still have the whole of wal_buffers to use up. When that is full, we
 will prevent further WAL insertions because we will be holding the
 WALwritelock to clear more space. So the rest of the system will lock
 up nicely, like we want, apart from read-only transactions.

I'm not sure that all writing transactions lock up hard is really so
much better than the current behavior.

My preference would be that we simply start failing writes with ERRORs
rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
condition.  Probably the cause is that it's being done inside a critical
section, but could we move that?

 Instead of PANICing, we should simply signal the checkpointer to
 perform a shutdown checkpoint.

And if that fails for lack of disk space?  In any case, what you're
proposing sounds like a lot of new complication in a code path that
is necessarily never going to be terribly well tested.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Simon Riggs

On 21 January 2014 18:35, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 The current situation is that if you run out of disk space while writing
 WAL, you get a PANIC, and the server shuts down. That's awful.

 I don't see we need to prevent WAL insertions when the disk fills. We
 still have the whole of wal_buffers to use up. When that is full, we
 will prevent further WAL insertions because we will be holding the
 WALwritelock to clear more space. So the rest of the system will lock
 up nicely, like we want, apart from read-only transactions.

 I'm not sure that all writing transactions lock up hard is really so
 much better than the current behavior.

Lock up momentarily, until the situation clears. But my proposal would
allow the situation to fully clear, i.e. all WAL files could be
deleted as soon as replication/archiving has caught up. The current
behaviour doesn't automatically correct itself as this proposal would.
My proposal is also fully safe in line with synchronous replication,
as well as zero performance overhead for mainline processing.

 My preference would be that we simply start failing writes with ERRORs
 rather than PANICs.

Yes, that is what I am proposing, amongst other points.

 I'm not real sure ATM why this has to be a PANIC
 condition.  Probably the cause is that it's being done inside a critical
 section, but could we move that?

Yes, I think so.

 Instead of PANICing, we should simply signal the checkpointer to
 perform a shutdown checkpoint.

 And if that fails for lack of disk space?

I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space.

 In any case, what you're
 proposing sounds like a lot of new complication in a code path that
 is necessarily never going to be terribly well tested.

It's the smallest amount of change proposed so far... I agree on the
danger of untested code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Greg Stark

Fwiw I think all transactions lock up until space appears is *much*
better than PANICing. Often disks fill up due to other transient
storage or people may have options to manually increase the amount of
space. it's much better if the database just continues to function
after that rather than need to be restarted.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Greg Stark st...@mit.edu writes:
 Fwiw I think all transactions lock up until space appears is *much*
 better than PANICing. Often disks fill up due to other transient
 storage or people may have options to manually increase the amount of
 space. it's much better if the database just continues to function
 after that rather than need to be restarted.

Well, PANIC is certainly bad, but what I'm suggesting is that we just
focus on getting that down to ERROR and not worry about trying to get
out of the disk-shortage situation automatically.  Nor do I believe
that it's such a good idea to have the database freeze up until space
appears rather than reporting errors.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Jeff Janes

On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:
  The current situation is that if you run out of disk space while writing
  WAL, you get a PANIC, and the server shuts down. That's awful.

  I don't see we need to prevent WAL insertions when the disk fills. We
  still have the whole of wal_buffers to use up. When that is full, we
  will prevent further WAL insertions because we will be holding the
  WALwritelock to clear more space. So the rest of the system will lock
  up nicely, like we want, apart from read-only transactions.

 I'm not sure that all writing transactions lock up hard is really so
 much better than the current behavior.

 My preference would be that we simply start failing writes with ERRORs
 rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
 condition.  Probably the cause is that it's being done inside a critical
 section, but could we move that?


My understanding is that if it runs out of buffer space while in an
XLogInsert, it will be holding one or more buffer content locks
exclusively, and unless it can complete the xlog (or scrounge up the info
to return that buffer to its previous state), it can never release that
lock.  There might be other paths were it could get by with an ERROR, but
if no one can write xlog anymore, all of those paths must quickly converge
to the one that cannot simply ERROR.

Cheers,

Jeff

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Jeff Janes jeff.ja...@gmail.com writes:
 On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 My preference would be that we simply start failing writes with ERRORs
 rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
 condition.  Probably the cause is that it's being done inside a critical
 section, but could we move that?

 My understanding is that if it runs out of buffer space while in an
 XLogInsert, it will be holding one or more buffer content locks
 exclusively, and unless it can complete the xlog (or scrounge up the info
 to return that buffer to its previous state), it can never release that
 lock.  There might be other paths were it could get by with an ERROR, but
 if no one can write xlog anymore, all of those paths must quickly converge
 to the one that cannot simply ERROR.

Well, the point is we'd have to somehow push detection of the problem
to a point before the critical section that does the buffer changes
and WAL insertion.

The first idea that comes to mind is (1) estimate the XLOG space needed
(an overestimate is fine here); (2) just before entering the critical
section, call some function to reserve that space, such that we always
have at least sum(outstanding reservations) available future WAL space;
(3) release our reservation as part of the actual XLogInsert call.

The problem here is that the reserve function would presumably need an
exclusive lock, and would be about as much of a hot spot as XLogInsert
itself is.  Plus we'd be paying a lot of extra cycles to solve a corner
case problem that, with all due respect, comes up pretty darn seldom.
So probably we need a better idea than that.

Maybe we could get some mileage out of the fact that very approximate
techniques would be good enough.  For instance, I doubt anyone would bleat
if the system insisted on having 10MB or even 100MB of future WAL space
always available.  But I'm not sure exactly how to make use of that
flexibility.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Peter Geoghegan

On Tue, Jan 21, 2014 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Maybe we could get some mileage out of the fact that very approximate
 techniques would be good enough.  For instance, I doubt anyone would bleat
 if the system insisted on having 10MB or even 100MB of future WAL space
 always available.  But I'm not sure exactly how to make use of that
 flexibility.

In the past I've thought that one approach that would eliminate
concerns about portably and efficiently knowing how much space is left
on the pg_xlog filesystem is to have a ballast file. Under this
scheme, perhaps XLogInsert() could differentiate between a soft and
hard failure. Hopefully the reserve function you mentioned, which is
still called at the same place, just before each critical section
thereby becomes cheap. Perhaps I'm just restating what you said,
though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-21 18:24:39 -0500, Tom Lane wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
  On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  My preference would be that we simply start failing writes with ERRORs
  rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
  condition.  Probably the cause is that it's being done inside a critical
  section, but could we move that?
 
  My understanding is that if it runs out of buffer space while in an
  XLogInsert, it will be holding one or more buffer content locks
  exclusively, and unless it can complete the xlog (or scrounge up the info
  to return that buffer to its previous state), it can never release that
  lock.  There might be other paths were it could get by with an ERROR, but
  if no one can write xlog anymore, all of those paths must quickly converge
  to the one that cannot simply ERROR.
 
 Well, the point is we'd have to somehow push detection of the problem
 to a point before the critical section that does the buffer changes
 and WAL insertion.

Well, I think that's already hard for the heapam.c stuff, but doing that
in xact.c seems fracking hairy. We can't simply stop in the middle of a
commit and not continue, that'd often grind the system to a halt
preventing cleanup. Additionally the size of the inserted record for
commits is essentially unbounded, which makes it an especially fun case.

 The first idea that comes to mind is (1) estimate the XLOG space needed
 (an overestimate is fine here); (2) just before entering the critical
 section, call some function to reserve that space, such that we always
 have at least sum(outstanding reservations) available future WAL space;
 (3) release our reservation as part of the actual XLogInsert call.

I think that's not necessarily enough. In a COW filesystem like btrfs or
ZFS you really cannot give much guarantees about writes suceeding or
failing, even if we were able to create (and zero) a new segment.

Even if you disregard that, we'd need to keep up with lots of concurrent
reservations, looking a fair bit into the future. E.g. during a smart
shutdown in a workload with lots of subtransactions trying to reserve
space might make the situation actually worse because we might end up
trying to reserve the combined size of records.

 The problem here is that the reserve function would presumably need an
 exclusive lock, and would be about as much of a hot spot as XLogInsert
 itself is.  Plus we'd be paying a lot of extra cycles to solve a corner
 case problem that, with all due respect, comes up pretty darn seldom.
 So probably we need a better idea than that.

Yea, I don't think anything really safe is going to work without
signifcant penalties.

 Maybe we could get some mileage out of the fact that very approximate
 techniques would be good enough.  For instance, I doubt anyone would bleat
 if the system insisted on having 10MB or even 100MB of future WAL space
 always available.  But I'm not sure exactly how to make use of that
 flexibility.

If we'd be more aggressive with preallocating WAL files and doing so in
the WAL writer, we could stop accepting writes in some common codepaths
(e.g. nodeModifyTable.c) as soon as preallocating failed but continue to
accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd
still fail if you write a *very* large commit record using up all the
reserve though...

I personally think this isn't worth complicating the code for.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 18:24:39 -0500, Tom Lane wrote:
 Maybe we could get some mileage out of the fact that very approximate
 techniques would be good enough.  For instance, I doubt anyone would bleat
 if the system insisted on having 10MB or even 100MB of future WAL space
 always available.  But I'm not sure exactly how to make use of that
 flexibility.

 If we'd be more aggressive with preallocating WAL files and doing so in
 the WAL writer, we could stop accepting writes in some common codepaths
 (e.g. nodeModifyTable.c) as soon as preallocating failed but continue to
 accept writes in other locations (e.g. TRUNCATE, DROP TABLE). That'd
 still fail if you write a *very* large commit record using up all the
 reserve though...

 I personally think this isn't worth complicating the code for.

I too have got doubts about whether a completely bulletproof solution
is practical.  (And as you say, even if our internal logic was
bulletproof, a COW filesystem defeats all guarantees in this area
anyway.)  But perhaps a 99% solution would be a useful compromise.

Another thing to think about is whether we couldn't put a hard limit on
WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
design anyway, when you get right down to it.  So for example maybe we
could split up commit records, with most of the bulky information dumped
into separate records that appear before the real commit.  This would
complicate replay --- in particular, if we abort the transaction after
writing a few such records, how does the replayer realize that it can
forget about those records?  But that sounds probably surmountable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
 Another thing to think about is whether we couldn't put a hard limit on
 WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
 design anyway, when you get right down to it.  So for example maybe we
 could split up commit records, with most of the bulky information dumped
 into separate records that appear before the real commit.  This would
 complicate replay --- in particular, if we abort the transaction after
 writing a few such records, how does the replayer realize that it can
 forget about those records?  But that sounds probably surmountable.

Which cases of essentially unbounded record sizes do we currently have?
The only ones that I remember right now are commit and abort records
(including when wrapped in a prepared xact).
Containing a) the list of committing/aborting subtransactions and b) the
list of files to drop c) cache invalidations.
Hm. There's also xl_standby_locks, but that'd be easily splittable.

I think removing the list of subtransactions from commit records would
essentially require not truncating pg_subtrans after a restart
anymore. If we'd truncate it in concord with pg_clog, we'd only need to
log the subxids which haven't been explicitly assigned. Unfortunately
pg_subtrans will be bigger than pg_clog, making such a scheme likely to
be painful.

We could relatively easily split of logging the dropped files from
commit records and log them in groups afterwards, we already have
several races allowing to leak files.

We could do something similar for the cache invalidations, but that
seems likely to get rather ugly, as we'd need to hold procarraylock till
the next record is read, or until a shutdown or end-of-recovery record
is read. If one of the latter is found before the corresponding
invalidations, we'd need to invalidate the entire syscache.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Simon Riggs

On 21 January 2014 23:01, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Jan 21, 2014 at 9:35 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 6 June 2013 16:00, Heikki Linnakangas hlinnakan...@vmware.com
  wrote:
  The current situation is that if you run out of disk space while
  writing
  WAL, you get a PANIC, and the server shuts down. That's awful.

  I don't see we need to prevent WAL insertions when the disk fills. We
  still have the whole of wal_buffers to use up. When that is full, we
  will prevent further WAL insertions because we will be holding the
  WALwritelock to clear more space. So the rest of the system will lock
  up nicely, like we want, apart from read-only transactions.

 I'm not sure that all writing transactions lock up hard is really so
 much better than the current behavior.

 My preference would be that we simply start failing writes with ERRORs
 rather than PANICs.  I'm not real sure ATM why this has to be a PANIC
 condition.  Probably the cause is that it's being done inside a critical
 section, but could we move that?


 My understanding is that if it runs out of buffer space while in an
 XLogInsert, it will be holding one or more buffer content locks exclusively,
 and unless it can complete the xlog (or scrounge up the info to return that
 buffer to its previous state), it can never release that lock.  There might
 be other paths were it could get by with an ERROR, but if no one can write
 xlog anymore, all of those paths must quickly converge to the one that
 cannot simply ERROR.

Agreed. You don't say it but I presume you intend to point out that
such long-lived contention could easily have a knock on effect to
other read-only statements. I'm pretty sure other databases work the
same way.

Our choice are

1. Waiting
2. Abort transactions
3. Some kind of release-locks-then-wait-and-retry

(3) is a step too far for me, even though it is easier than you say
since we write WAL before changing the data block so a failure to
insert WAL could just result in a temporary drop lock, sleep and
retry.

I would go for (1) waiting for up to checkpoint_timeout then (2), if
we think that is a problem.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-22 01:18:36 +0100, Simon Riggs wrote:
  My understanding is that if it runs out of buffer space while in an
  XLogInsert, it will be holding one or more buffer content locks exclusively,
  and unless it can complete the xlog (or scrounge up the info to return that
  buffer to its previous state), it can never release that lock.  There might
  be other paths were it could get by with an ERROR, but if no one can write
  xlog anymore, all of those paths must quickly converge to the one that
  cannot simply ERROR.
 
 Agreed. You don't say it but I presume you intend to point out that
 such long-lived contention could easily have a knock on effect to
 other read-only statements. I'm pretty sure other databases work the
 same way.
 
 Our choice are
 
 1. Waiting
 2. Abort transactions
 3. Some kind of release-locks-then-wait-and-retry
 
 (3) is a step too far for me, even though it is easier than you say
 since we write WAL before changing the data block so a failure to
 insert WAL could just result in a temporary drop lock, sleep and
 retry.
 
 I would go for (1) waiting for up to checkpoint_timeout then (2), if
 we think that is a problem.

How are we supposed to wait while e.g. ProcArrayLock? Aborting
transactions doesn't work either, that writes abort records which can
get signficantly large.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
 Another thing to think about is whether we couldn't put a hard limit on
 WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
 design anyway, when you get right down to it.  So for example maybe we
 could split up commit records, with most of the bulky information dumped
 into separate records that appear before the real commit.  This would
 complicate replay --- in particular, if we abort the transaction after
 writing a few such records, how does the replayer realize that it can
 forget about those records?  But that sounds probably surmountable.

 I think removing the list of subtransactions from commit records would
 essentially require not truncating pg_subtrans after a restart
 anymore.

I'm not suggesting that we stop providing that information!  I'm just
saying that we perhaps don't need to store it all in one WAL record,
if instead we put the onus on WAL replay to be able to reconstruct what
it needs from a series of WAL records.

 We could relatively easily split of logging the dropped files from
 commit records and log them in groups afterwards, we already have
 several races allowing to leak files.

I was thinking the other way around: emit the subsidiary records before the
atomic commit or abort record, indeed before we've actually committed.
Part of the point is to reduce the risk that lack of WAL space would
prevent us from fully committing.  Also, writing those records afterwards
increases the risk of a post-commit failure, which is a bad thing.

Replay would then involve either accumulating the subsidiary records in
memory, or being willing to go back and re-read them when the real commit
or abort record is seen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Andres Freund and...@2ndquadrant.com writes:
 How are we supposed to wait while e.g. ProcArrayLock? Aborting
 transactions doesn't work either, that writes abort records which can
 get signficantly large.

Yeah, that's an interesting point ;-).  We can't *either* commit or abort
without emitting some WAL, possibly quite a bit of WAL.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2014-01-21 Thread Peter Geoghegan

On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund and...@2ndquadrant.com wrote:
 I personally think this isn't worth complicating the code for.

You're probably right. However, I don't see why the bar has to be very
high when we're considering the trade-off between taking some
emergency precaution against having a PANIC shutdown, and an assured
PANIC shutdown. Heikki said somewhere upthread that he'd be happy with
a solution that only catches 90% of the cases. That is probably a
conservative estimate. The schemes discussed here would probably be
much more effective than that in practice. Sure, you can still poke
holes in them. For example, there has been some discussion of
arbitrarily large commit records. However, this is the kind of thing
just isn't that relevant in the real world. I believe that in practice
the majority of commit records are all about the same size.

I do not believe that the two acceptable outcomes here are either that
we continue to always PANIC shutdown (i.e. do nothing), or promise to
never PANIC shutdown. There is likely to be a third way, which is that
the probability of a PANIC shutdown is, at the macro level, reduced
somewhat from the present probability of 1.0. People are not going to
develop a lackadaisical attitude about running out of disk space on
the pg_xlog partition if we do so. They still have plenty of incentive
to make sure that that doesn't happen.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
  Another thing to think about is whether we couldn't put a hard limit on
  WAL record size somehow.  Multi-megabyte WAL records are an abuse of the
  design anyway, when you get right down to it.  So for example maybe we
  could split up commit records, with most of the bulky information dumped
  into separate records that appear before the real commit.  This would
  complicate replay --- in particular, if we abort the transaction after
  writing a few such records, how does the replayer realize that it can
  forget about those records?  But that sounds probably surmountable.
 
  I think removing the list of subtransactions from commit records would
  essentially require not truncating pg_subtrans after a restart
  anymore.
 
 I'm not suggesting that we stop providing that information!  I'm just
 saying that we perhaps don't need to store it all in one WAL record,
 if instead we put the onus on WAL replay to be able to reconstruct what
 it needs from a series of WAL records.

That'd likely require something similar to the incomplete actions used
in btrees (and until recently in more places). I think that is/was a
disaster I really don't want to extend.

  We could relatively easily split of logging the dropped files from
  commit records and log them in groups afterwards, we already have
  several races allowing to leak files.
 
 I was thinking the other way around: emit the subsidiary records before the
 atomic commit or abort record, indeed before we've actually committed.
 Part of the point is to reduce the risk that lack of WAL space would
 prevent us from fully committing.
 Replay would then involve either accumulating the subsidiary records in
 memory, or being willing to go back and re-read them when the real commit
 or abort record is seen.

Well, the reason I suggested doing it the other way round is that we
wouldn't need to reassemble anything (outside of cache invalidations
which I don't know how to handle that way) which I think is a
significant increase in robustness and decrease in complexity.

 Also, writing those records afterwards
 increases the risk of a post-commit failure, which is a bad thing.

Well, most of those could be done outside of a critical section,
possibly just FATALing out. Beats PANICing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
 I'm not suggesting that we stop providing that information!  I'm just
 saying that we perhaps don't need to store it all in one WAL record,
 if instead we put the onus on WAL replay to be able to reconstruct what
 it needs from a series of WAL records.

 That'd likely require something similar to the incomplete actions used
 in btrees (and until recently in more places). I think that is/was a
 disaster I really don't want to extend.

I don't think that's a comparable case.  Incomplete actions are actions
to be taken immediately, and which the replayer then has to complete
somehow if it doesn't find the rest of the action in the WAL sequence.
The only thing to be done with the records I'm proposing is to remember
their contents (in some fashion) until it's time to apply them.  If you
hit end of WAL you don't really have to do anything.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-21 16:34:45 -0800, Peter Geoghegan wrote:
 On Tue, Jan 21, 2014 at 3:43 PM, Andres Freund and...@2ndquadrant.com wrote:
  I personally think this isn't worth complicating the code for.
 
 You're probably right. However, I don't see why the bar has to be very
 high when we're considering the trade-off between taking some
 emergency precaution against having a PANIC shutdown, and an assured
 PANIC shutdown

Well, the problem is that the tradeoff would very likely include making
already complex code even more complex. None of the proposals, even the
one just decreasing the likelihood of a PANIC, like like they'd end up
being simple implementation-wise.
And that additional complexity would hurt robustness and prevent things
I find much more important than this.

 Heikki said somewhere upthread that he'd be happy with
 a solution that only catches 90% of the cases. That is probably a
 conservative estimate. The schemes discussed here would probably be
 much more effective than that in practice. Sure, you can still poke
 holes in them. For example, there has been some discussion of
 arbitrarily large commit records. However, this is the kind of thing
 just isn't that relevant in the real world. I believe that in practice
 the majority of commit records are all about the same size.

Yes, realistically the boundary will be relatively low, but I don't
think that that means that we can disregard issues like the possibility
that a record might be bigger than wal_buffers. Not because it'd allow
theoretical issues, but because it rules out several tempting approaches
like e.g. extending the in-memory reservation scheme of Heikki's
scalability work to handle this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
  I'm not suggesting that we stop providing that information!  I'm just
  saying that we perhaps don't need to store it all in one WAL record,
  if instead we put the onus on WAL replay to be able to reconstruct what
  it needs from a series of WAL records.
 
  That'd likely require something similar to the incomplete actions used
  in btrees (and until recently in more places). I think that is/was a
  disaster I really don't want to extend.
 
 I don't think that's a comparable case.  Incomplete actions are actions
 to be taken immediately, and which the replayer then has to complete
 somehow if it doesn't find the rest of the action in the WAL sequence.
 The only thing to be done with the records I'm proposing is to remember
 their contents (in some fashion) until it's time to apply them.  If you
 hit end of WAL you don't really have to do anything.

Would that work for the promotion case as well? Afair there's the
assumption that everything = TransactionXmin can be looked up in
pg_subtrans or in the procarray - which afaics wouldn't be the case with
your scheme? And TransactionXmin could very well be below such an
incomplete commit's xids afaics.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

Andres Freund and...@2ndquadrant.com writes:
 On 2014-01-21 19:45:19 -0500, Tom Lane wrote:
 I don't think that's a comparable case.  Incomplete actions are actions
 to be taken immediately, and which the replayer then has to complete
 somehow if it doesn't find the rest of the action in the WAL sequence.
 The only thing to be done with the records I'm proposing is to remember
 their contents (in some fashion) until it's time to apply them.  If you
 hit end of WAL you don't really have to do anything.

 Would that work for the promotion case as well? Afair there's the
 assumption that everything = TransactionXmin can be looked up in
 pg_subtrans or in the procarray - which afaics wouldn't be the case with
 your scheme? And TransactionXmin could very well be below such an
 incomplete commit's xids afaics.

Uh, what?  The behavior I'm talking about is *exactly the same*
as what happens now.  The only change is that the data sent to the
WAL file is laid out a bit differently, and the replay logic has
to work harder to reassemble it before it can apply the commit or
abort action.  If anything outside replay can detect a difference
at all, that would be a bug.

Once again: the replayer is not supposed to act immediately on the
subsidiary records.  It's just supposed to remember their contents
so it can reattach them to the eventual commit or abort record,
and then do what it does today to replay the commit or abort.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-22 Thread Bruce Momjian

On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote:
 (I'm still learning the details of Pg's WAL, WAL replay and recovery, so
 the below's just my understanding):
 
 The problem is that WAL for all tablespaces is mixed together in the
 archives. If you lose your tablespace then you have to keep *all* WAL
 around and replay *all* of it again when the tablespace comes back
 online. This would be very inefficient, would require a lot of tricks to
 cope with applying WAL to a database that has an on-disk state in the
 future as far as the archives are concerned. It's not as simple as just
 replaying all WAL all over again - as I understand it, things like
 CLUSTER or TRUNCATE will result in relfilenodes not being where they're
 expected to be as far as old WAL archives are concerned. Selective
 replay would be required, and that leaves the door open to all sorts of
 new and exciting bugs in areas that'd hardly ever get tested.
 
 To solve the massive disk space explosion problem I imagine we'd have to
 have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
 and loss of the rather nice property that WAL writes are nice sequential
 writes. It'd be complicated and probably cause nightmares during
 recovery, for archive-based replication, etc.
 
 The only other thing I can think of is: When a tablespace is offline,
 write WAL records to a separate tablespace recovery log as they're
 encountered. Replay this log when the tablespace comes is restored,
 before applying any other new WAL to the tablespace. This wouldn't
 affect archive-based recovery since it'd already have the records from
 the original WAL.
 
 None of these options seem exactly simple or pretty, especially given
 the additional complexities that'd be involved in allowing WAL records
 to be applied out-of-order, something that AFAIK _never_h happens at the
 moment.
 
 The key problem, of course, is that this all sounds like a lot of
 complicated work for a case that's not really supposed to happen. Right
 now, the answer is your database is unrecoverable, switch to your
 streaming warm standby and re-seed it from the standby. Not pretty, but
 at least there's the option of using a sync standby and avoiding data loss.
 
 How would you approach this?

Sorry to be replying late.  You are right that you could record/apply
WAL separately for offline tablespaces.  The problem is that you could
have logical ties from the offline tablespace to online tablespaces. 
For example, what happens if data in an online tablespace references a
primary key in an offline tablespace.  What if the system catalogs are
stored in an offline tablespace?  Right now, we allow logical bindings
across physical tablespaces.  To do what you want, you would really need
to store each database in its own tablespace.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-17 Thread Dimitri Fontaine

Peter Eisentraut pete...@gmx.net writes:
 I suspect that there are actually only about 5 or 6 common ways to do
 archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
 we can't fully specify and/or script what to do in each of these cases.

And provide either fully reliable contrib scripts or internal archive
commands ready to use for those common cases. I can't think of other
common use cases, by the way.

+1

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote:
 Yes, I feel designing reliable archiving, even for the simplest case - copy
 WAL to disk, is very difficult.  I know there are following three problems
 if you just follow the PostgreSQL manual.  Average users won't notice them.
 I guess even professional DBAs migrating from other DBMSs won't, either.

 1. If the machine or postgres crashes while archive_command is copying a WAL
 file, later archive recovery fails.
 This is because cp leaves a file of less than 16MB in archive area, and
 postgres refuses to start when it finds such a small archive WAL file.
 The solution, which IIRC Tomas san told me here, is to do like cp %p
 /archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.

 2. archive_command dumps core when you run pg_ctl stop -mi.
 This is because postmaster sends SIGQUIT to all its descendants.  The core
 files accumulate in the data directory, which will be backed up with the
 database.  Of course those core files are garbage.
 archive_command script needs to catch SIGQUIT and exit.

 3. You cannot know the reason of archive_command failure (e.g. archive area
 full) if you don't use PostgreSQL's server logging.
 This is because archive_command failure is not logged in syslog/eventlog.


 I hope PostgreSQL will provide a reliable archiving facility that is ready
 to use.

+1.  I think we should have a way to set an archive DIRECTORY, rather
than an archive command.  And if you set it, then PostgreSQL should
just do all of that stuff correctly, without any help from the user.
Of course, some users will want to archive to a remote machine via ssh
or rsync or what-have-you, and those users will need to provide their
own tools.  But it's got to be pretty common to archive to a local
path that happens to be a remote mount, or to a local directory whose
contents are subsequently copied off by a batch job.  Making that work
nicely with near-zero configuration would be a significant advance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Claudio Freire

On Wed, Jun 12, 2013 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
 I hope PostgreSQL will provide a reliable archiving facility that is ready
 to use.

 +1.  I think we should have a way to set an archive DIRECTORY, rather
 than an archive command.  And if you set it, then PostgreSQL should
 just do all of that stuff correctly, without any help from the user.
 Of course, some users will want to archive to a remote machine via ssh
 or rsync or what-have-you, and those users will need to provide their
 own tools.  But it's got to be pretty common to archive to a local
 path that happens to be a remote mount, or to a local directory whose
 contents are subsequently copied off by a batch job.  Making that work
 nicely with near-zero configuration would be a significant advance.


That, or provide a standard archive command that takes the directory
as argument?

I bet we have tons of those available among us users...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Tatsuo Ishii

 On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote:
 Yes, I feel designing reliable archiving, even for the simplest case - copy
 WAL to disk, is very difficult.  I know there are following three problems
 if you just follow the PostgreSQL manual.  Average users won't notice them.
 I guess even professional DBAs migrating from other DBMSs won't, either.

 1. If the machine or postgres crashes while archive_command is copying a WAL
 file, later archive recovery fails.
 This is because cp leaves a file of less than 16MB in archive area, and
 postgres refuses to start when it finds such a small archive WAL file.
 The solution, which IIRC Tomas san told me here, is to do like cp %p
 /archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.

 2. archive_command dumps core when you run pg_ctl stop -mi.
 This is because postmaster sends SIGQUIT to all its descendants.  The core
 files accumulate in the data directory, which will be backed up with the
 database.  Of course those core files are garbage.
 archive_command script needs to catch SIGQUIT and exit.

 3. You cannot know the reason of archive_command failure (e.g. archive area
 full) if you don't use PostgreSQL's server logging.
 This is because archive_command failure is not logged in syslog/eventlog.


 I hope PostgreSQL will provide a reliable archiving facility that is ready
 to use.
 
 +1.  I think we should have a way to set an archive DIRECTORY, rather
 than an archive command.  And if you set it, then PostgreSQL should
 just do all of that stuff correctly, without any help from the user.
 Of course, some users will want to archive to a remote machine via ssh
 or rsync or what-have-you, and those users will need to provide their
 own tools.  But it's got to be pretty common to archive to a local
 path that happens to be a remote mount, or to a local directory whose
 contents are subsequently copied off by a batch job.  Making that work
 nicely with near-zero configuration would be a significant advance.

And there's another example why we need an archive command:

 I'm just setting up pgpool replication on Amazon AWS.
 I'm sending WAL archives to an S3 bucket, which doesn't appear as a
 directory on the server.

From:
http://www.pgpool.net/pipermail/pgpool-general/2013-June/001851.html
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Magnus Hagander

On Jun 12, 2013 4:56 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Jun 8, 2013 at 10:36 AM, MauMau maumau...@gmail.com wrote:
  Yes, I feel designing reliable archiving, even for the simplest case -
copy
  WAL to disk, is very difficult.  I know there are following three
problems
  if you just follow the PostgreSQL manual.  Average users won't notice
them.
  I guess even professional DBAs migrating from other DBMSs won't, either.
 
  1. If the machine or postgres crashes while archive_command is copying
a WAL
  file, later archive recovery fails.
  This is because cp leaves a file of less than 16MB in archive area, and
  postgres refuses to start when it finds such a small archive WAL file.
  The solution, which IIRC Tomas san told me here, is to do like cp %p
  /archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.
 
  2. archive_command dumps core when you run pg_ctl stop -mi.
  This is because postmaster sends SIGQUIT to all its descendants.  The
core
  files accumulate in the data directory, which will be backed up with the
  database.  Of course those core files are garbage.
  archive_command script needs to catch SIGQUIT and exit.
 
  3. You cannot know the reason of archive_command failure (e.g. archive
area
  full) if you don't use PostgreSQL's server logging.
  This is because archive_command failure is not logged in
syslog/eventlog.
 
 
  I hope PostgreSQL will provide a reliable archiving facility that is
ready
  to use.

 +1.  I think we should have a way to set an archive DIRECTORY, rather
 than an archive command.  And if you set it, then PostgreSQL should
 just do all of that stuff correctly, without any help from the user.

Wouldn't that encourage people to do local archiving, which is almost
always a bad idea?

I'd rather improve the experience with pg_receivexlog or another way that
does remote archiving...

 Of course, some users will want to archive to a remote machine via ssh
 or rsync or what-have-you, and those users will need to provide their
 own tools.  But it's got to be pretty common to archive to a local
 path that happens to be a remote mount, or to a local directory whose
 contents are subsequently copied off by a batch job.  Making that work
 nicely with near-zero configuration would be a significant advance.

I guess archiving to a nfs mount or so isn't too bad, but archiving and
using a cronjob to get the files off is typically a great way to loose
data, and we really shouldn't encourage that by default, Imo.

/Magnus

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On Sat, Jun 8, 2013 at 7:20 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 If archiving is on and failure is due to no space, could we just keep trying
 XLogFileInit again for a couple minutes to give archiving a chance to do its
 things?  Doing that while holding onto locks and a critical section would be
 unfortunate, but if the alternative is a PANIC, it might be acceptable.

Blech.  I think that's setting our standards pretty low.  It would
neither be possible to use the system nor to shut it down cleanly; I
think the effect would be to turn an immediate PANIC into a
slightly-delayed PANIC, possibly accompanied by some DBA panic.

It seems to me that there are two general ways of approaching this problem.

1. Discover sooner that we're out of space.  Once we've modified the
buffer and entered the critical section, it's too late to have second
thoughts about completing the operation.  If we could guarantee prior
to modifying the buffers that enough WAL space was present to store
the record we're about to write, then we'd be certain not to fail for
this reason.  In theory, this is simple: keep track of how much
uncommitted WAL space we have.  Increment the value when we create new
WAL segments and decrement it by the size of the WAL record we plan to
write.  In practice, it's not so simple.  We don't know whether we're
going to emit FPIs until after we enter the critical section, so the
size of the record can't be known precisely early enough.  We could
think about estimating the space needed conservatively and truing it
up occasionally.  However, there's a second problem: the
available-WAL-space counter would surely become a contention point.

Here's a sketch of a possible solution.  Suppose we know that an
individual WAL record can't be larger than, uh, 64kB.  I'm not sure
there is a limit on the size of a WAL record, but let's say there is,
or we can install one, at around that size limit.  Before we enter a
critical section that's going to write a WAL record, we verify that
the amount of WAL space remaining is at least 64kB * MaxBackends.  If
it's not, we embark on a series of short sleeps, rechecking after each
one; if we hit some time limit, we ERROR out.  As long as every
backend checks this before every WAL record, we can always be sure
there will be at least 64kB left for us.  With this approach, the
shared variable that stores the amount of WAL space remaining only
needs to be updated under WALInsertLock; the reservation step only
involves a read.  That might be cheap enough not to matter.

2. Recover from the fact that we ran out of space by backing out the
changes to shared buffers.  Initially, I thought this might be a
promising approach: if we've modified any shared buffers and discover
that we can't log the changes, just invalidate the buffers!  Of
course, it doesn't work, because the buffer might have have already
been dirty when we locked it.  So we'd actually need a way to reverse
out all the changes we were about to log.  That's probably too
expensive to contemplate, in general; and the code would likely get so
little testing as to invite bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On Wed, Jun 12, 2013 at 11:32 AM, Magnus Hagander mag...@hagander.net wrote:
 Wouldn't that encourage people to do local archiving, which is almost always
 a bad idea?

Maybe, but refusing to improve the UI because people might then use
the feature seems wrong-headed.

 I'd rather improve the experience with pg_receivexlog or another way that
 does remote archiving...

Sure, remote archiving is great, and I'm glad you've been working on
it.  In general, I think that's a cleaner approach, but there are
still enough people using archive_command that we can't throw them
under the bus.

 I guess archiving to a nfs mount or so isn't too bad, but archiving and
 using a cronjob to get the files off is typically a great way to loose data,
 and we really shouldn't encourage that by default, Imo.

Well, I think what we're encouraging right now is for people to do it
wrong.  The proliferation of complex tools to manage this process
suggests that it is not easy to manage without a complex tool.  That's
a problem.  And we regularly have users who discover, under a variety
of circumstances, that they've been doing it wrong.  If there's a
better solution than hard-wiring some smarts about local directories,
I'm all ears - but making the simple case just work would still be
better than doing nothing.  Right now you have to be a rocket
scientist no matter what configuration you're running.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Peter Eisentraut

On 6/12/13 10:55 AM, Robert Haas wrote:
 But it's got to be pretty common to archive to a local
 path that happens to be a remote mount, or to a local directory whose
 contents are subsequently copied off by a batch job.  Making that work
 nicely with near-zero configuration would be a significant advance.

Doesn't that just move the problem to managing NFS or batch jobs?  Do we
want to encourage that?

I suspect that there are actually only about 5 or 6 common ways to do
archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
we can't fully specify and/or script what to do in each of these cases.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

On Wed, Jun 12, 2013 at 12:07 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 6/12/13 10:55 AM, Robert Haas wrote:
 But it's got to be pretty common to archive to a local
 path that happens to be a remote mount, or to a local directory whose
 contents are subsequently copied off by a batch job.  Making that work
 nicely with near-zero configuration would be a significant advance.

 Doesn't that just move the problem to managing NFS or batch jobs?  Do we
 want to encourage that?

 I suspect that there are actually only about 5 or 6 common ways to do
 archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
 we can't fully specify and/or script what to do in each of these cases.

Go for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Joshua D. Drake



On 06/12/2013 08:49 AM, Robert Haas wrote:


Sure, remote archiving is great, and I'm glad you've been working on
it.  In general, I think that's a cleaner approach, but there are
still enough people using archive_command that we can't throw them
under the bus.


Correct.




I guess archiving to a nfs mount or so isn't too bad, but archiving and
using a cronjob to get the files off is typically a great way to loose data,
and we really shouldn't encourage that by default, Imo.




We certainly not by default but it is also something that can be easy to 
set up reliably if you know what you are doing.




Well, I think what we're encouraging right now is for people to do it
wrong.  The proliferation of complex tools to manage this process
suggests that it is not easy to manage without a complex tool.


No. It suggests that people have more than one requirement that the 
project WILL NEVER be able to solve.


Granted we have solved some of them, for example pg_basebackup. However, 
pg_basebackup isn't really useful for a large database. Multithreaded 
rsync is much more efficient.




 That's
a problem.  And we regularly have users who discover, under a variety
of circumstances, that they've been doing it wrong.  If there's a
better solution than hard-wiring some smarts about local directories,
I'm all ears - but making the simple case just work would still be
better than doing nothing.


Agreed.



 Right now you have to be a rocket
scientist no matter what configuration you're running.


This is quite a bit overblown. Assuming your needs are simple. Archiving 
is at it is now, a relatively simple process to set up, even without 
something like PITRTools.  Where we run into trouble is when they aren't 
and that is ok because we can't solve every problem. We can only provide 
tools for others to solve their particular issue.


What concerns me is we seem to be trying to make this easy. It isn't 
supposed to be easy. This is hard stuff. Smart people built it and it 
takes a smart person to run it. When did it become a bad thing to be 
something that smart people need to run?


Yes, we need to make it reliable. We don't need to be the Nanny database.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-12 Thread Claudio Freire

On Wed, Jun 12, 2013 at 6:03 PM, Joshua D. Drake j...@commandprompt.com wrote:

  Right now you have to be a rocket
 scientist no matter what configuration you're running.


 This is quite a bit overblown. Assuming your needs are simple. Archiving is
 at it is now, a relatively simple process to set up, even without something
 like PITRTools.  Where we run into trouble is when they aren't and that is
 ok because we can't solve every problem. We can only provide tools for
 others to solve their particular issue.

 What concerns me is we seem to be trying to make this easy. It isn't
 supposed to be easy. This is hard stuff. Smart people built it and it takes
 a smart person to run it. When did it become a bad thing to be something
 that smart people need to run?

 Yes, we need to make it reliable. We don't need to be the Nanny database.


More than easy, it should be obvious.

Obvious doesn't mean easy, it just means what you have to do to get it
right is clearly in front of you. When you give people the freedom of
an archive command, you also take away any guidance more restricting
options give. I think the point here is that a default would guide
people in how to make this work reliably, without having to rediscover
it every time. A good, *obvious* (not easy) default. Even cp blah to
NFS mount is obvious, while not easy (setting up an NFS through
firewalls is never easy).

So, having archive utilities in place of cp would ease the burden of
administration, because it'd be based on collective knowledge. Some
pg_cp (or more likely pg_archive_wal) could check there's enough
space, and whatever else collective knowledge decided is necessary.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread MauMau


From: Craig Ringer cr...@2ndquadrant.com

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.


Although I still lack understanding of PostgreSQL implementation, I have an 
optimistic feeling that such complexity would not be required.  While a 
tablespace is offline, subsequent access from new or existing transactions 
is rejected with an error tablespace is offline.  So new WAL records would 
not be generated for the offline tablespace.  To take the tablespace back 
online, the DBA performs per-tablespace archive recovery.  Per-tablespace 
archive recovery restores tablespace data files from the backup, then read 
through archive and pg_xlog/ WAL as usual, and selectively applies WAL 
records for the tablespace.


I don't think it's a must-be-fixed problem that the WAL for all 
tablespaces is mixed in one location.  I suppose we can tolerate that 
archive recovery takes a long time.




To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.


Per-tablespace WAL is very interesting for another reason -- massive-scale 
OLTP for database consolidation.  This feature would certainly be a 
breakthrough for amazing performance, because WAL is usually the last 
bottleneck in OLTP.  Yes, I can imagine recovery would be much, much more 
complicated,.




None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.


As I mentioned above, in my shallow understanding, it seems that the 
additional complexities can be controlled.




The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby. Not pretty, but
at least there's the option of using a sync standby and avoiding data 
loss.


Sync standby... maybe.  Let me consider this.


How would you approach this?


Thanks Craig, you gave me some interesting insights.  All of these topics 
are interesting, and I'd like to work on them when I have acquired enough 
knowledge and experience in PostgreSQL development.


Regards
MauMau





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Josh Berkus

Josh, Daniel,

 Right now, what we're telling users is You can have continuous backup
 with Postgres, but you'd better hire and expensive consultant to set it
 up for you, or use this external tool of dubious provenance which
 there's no packages for, or you might accidentally cause your database
 to shut down in the middle of the night.
 
 This is an outright falsehood. We are telling them, You better know
 what you are doing or You should call a consultant. This is no
 different than, You better know what you are doing or You should take
 driving lessons.

What I'm pointing out is that there is no simple case for archiving
the way we have it set up.  That is, every possible way to deploy PITR
for Postgres involves complex, error-prone configuration, setup, and
monitoring.  I don't think that's necessary; simple cases should have
simple solutions.

If you do a quick survey of pgsql-general, you will see that the issue
of databases shutting down unexpectedly due to archiving running them
out of disk space is a very common problem.  People shouldn't be afraid
of their backup solutions.

I'd agree that one possible answer for this is to just get one of the
external tools simplified, well-packaged, distributed, instrumented for
common monitoring systems, and referenced in our main documentation.
I'd say Barman is the closest to a simple solution for the simple
common case, at least for PITR.  I've been able to give some clients
Barman and have them deploy it themselves.  This isn't true of the other
tools I've tried.  Too bad it's GPL, and doesn't do archiving-for-streaming.

 I have a clear bias in experience here, but I can't relate to someone
 who sets up archives but is totally okay losing a segment unceremoniously,
 because it only takes one of those once in a while to make a really,
 really bad day.  Who is this person that lackadaisically archives, and
 are they just fooling themselves?  And where are these archivers that

If WAL archiving is your *second* level of redundancy, you will
generally be willing to have it break rather than interfere with the
production workload.  This is particularly the case if you're using
archiving just as a backup for streaming replication.  Heck, I've had
one client where archiving was being used *only* to spin up staging
servers, and not for production at all; do you think they wanted
production to shut down if they ran out of archive space (which it did)?

I'll also point out that archiving can silently fail for a number of
reasons having nothing to do with safety options, such as an NFS mount
in Linux silently going away (I've also had this happen), or network
issues causing file corruption.  Which just points out that we need
better ways to detect gaps/corruption in archiving.

Anyway, what I'm pointing out is that this is a business decision, and
there is no way that we can make a decision for the users what to do
when we run out of WAL space.  And that the stop archiving option
needs to be there for users, as well as the shut down option.
*without* requiring users to learn the internals of the archiving system
to implement it, or to know the implied effects of non-obvious
PostgreSQL settings.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Jeff Janes

On Sat, Jun 8, 2013 at 11:07 AM, Joshua D. Drake j...@commandprompt.comwrote:


 On 06/08/2013 07:36 AM, MauMau wrote:

  1. If the machine or postgres crashes while archive_command is copying a
 WAL file, later archive recovery fails.
 This is because cp leaves a file of less than 16MB in archive area, and
 postgres refuses to start when it finds such a small archive WAL file.


Should that be changed?  If the file is 16MB but it turns to gibberish
after 3MB, recovery proceeds up to the gibberish.  Given that, why should
it refuse to start if the file is only 3MB to start with?


 The solution, which IIRC Tomas san told me here, is to do like cp %p
 /archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.



This will overwrite /archive/dir/%f if it already exists, which is usually
recommended against.  Although I don't know that it necessarily should be.
 One common problem with archiving is for a network glitch to occur during
the archive command, so the archive command fails and tries again later.
 But the later tries will always fail, because the target was created
before/during the glitch.  Perhaps a more full featured archive command
would detect and rename an existing file, rather than either overwriting it
or failing.

If we have no compunction about overwriting the file, then I don't see a
reason to use the cp + mv combination.  If the simple cp fails to copy the
entire file, it will be tried again until it succeeds.


Well it seems to me that one of the problems here is we tell people to use
 copy. We should be telling people to use a command (or supply a command)
 that is smarter than that.


Actually we describe what archive_command needs to fulfill, and tell them
to use something that accomplishes that.  The example with cp is explicitly
given as an example, not a recommendation.





  3. You cannot know the reason of archive_command failure (e.g. archive
 area full) if you don't use PostgreSQL's server logging.
 This is because archive_command failure is not logged in syslog/eventlog.


 Wait, what? Is this true (someone else?)



It is kind of true.  PostgreSQL does not automatically arrange for the
stderr of the archive_command to be sent to syslog.  But archive_command
can do whatever it wants, including arranging for its own failure messages
to go to syslog.

Cheers,

Jeff

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Daniel Farina

On Mon, Jun 10, 2013 at 11:59 AM, Josh Berkus j...@agliodbs.com wrote:
 Anyway, what I'm pointing out is that this is a business decision, and
 there is no way that we can make a decision for the users what to do
 when we run out of WAL space.  And that the stop archiving option
 needs to be there for users, as well as the shut down option.
 *without* requiring users to learn the internals of the archiving system
 to implement it, or to know the implied effects of non-obvious
 PostgreSQL settings.

I don't doubt this, that's why I do have a no-op fallback for
emergencies.  The discussion was about defaults.  I still think that
drop-wal-from-archiving-whenever is not a good one.

You may have noticed I also wrote that a neater, common way to drop
WAL when under pressure might be nice, to avoid having it ad-hoc and
all over, so it's not as though I wanted to suggest an Postgres
feature to this effect was an anti-feature.

And, as I wrote before, it's much easier to teach an external system
to drop WAL than it is to teach Postgres to attenuate, hence the
repeated correspondence from my fellows and myself about attenuation
side of the equation.

Hope that clears things up about where I stand on the matter.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Josh Berkus

Daniel, Jeff,

 I don't doubt this, that's why I do have a no-op fallback for
 emergencies.  The discussion was about defaults.  I still think that
 drop-wal-from-archiving-whenever is not a good one.

Yeah, we can argue defaults for a long time.  What would be better is
some way to actually determine what the user is trying to do, or wants
to happen.  That's why I'd be in favor of an explict setting; if there's
a setting which says:

on_archive_failure=shutdown

... then it's a LOT clearer to the user what will happen if the archive
runs out of space, even if we make no change to the defaults.  And if
that setting is changeable on reload, it even becomes a way for users to
get out of tight spots.

 You may have noticed I also wrote that a neater, common way to drop
 WAL when under pressure might be nice, to avoid having it ad-hoc and
 all over, so it's not as though I wanted to suggest an Postgres
 feature to this effect was an anti-feature.

Yep.  Drake was saying it was an anti-feature, though, so I was arguing
with him.

 Well it seems to me that one of the problems here is we tell people to use
 copy. We should be telling people to use a command (or supply a command)
 that is smarter than that.

 
 Actually we describe what archive_command needs to fulfill, and tell them
 to use something that accomplishes that.  The example with cp is explicitly
 given as an example, not a recommendation.

If we offer cp as an example, we *are* recommending it.  If we don't
recommend it, we shouldn't have it as an example.

In fact, if we don't recommend cp, then PostgreSQL should ship with some
example shell scripts for archive commands, just as we do for init scripts.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Daniel Farina

On Mon, Jun 10, 2013 at 4:42 PM, Josh Berkus j...@agliodbs.com wrote:
 Daniel, Jeff,

 I don't doubt this, that's why I do have a no-op fallback for
 emergencies.  The discussion was about defaults.  I still think that
 drop-wal-from-archiving-whenever is not a good one.

 Yeah, we can argue defaults for a long time.  What would be better is
 some way to actually determine what the user is trying to do, or wants
 to happen.  That's why I'd be in favor of an explict setting; if there's
 a setting which says:

 on_archive_failure=shutdown

 ... then it's a LOT clearer to the user what will happen if the archive
 runs out of space, even if we make no change to the defaults.  And if
 that setting is changeable on reload, it even becomes a way for users to
 get out of tight spots.

I like your suggestion, save one thing: it's not a 'failure' or
archiving if it cannot keep up, provided one subscribes to the view
that archiving is not elective.  I nit pick at this because one might
think this has something to do with a non-zero return code from the
archiving program, which already has a pretty alarmist message in
event of transient failures (I think someone brought this up on
-hackers but a few months ago...can't remember if that resulted in a
change).

I don't have a better suggestion that is less jargonrific though, but
I wanted to express my general appreciation as to the shape of the
suggestion.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Joshua D. Drake



On 06/10/2013 04:42 PM, Josh Berkus wrote:


Actually we describe what archive_command needs to fulfill, and tell them
to use something that accomplishes that.  The example with cp is explicitly
given as an example, not a recommendation.


If we offer cp as an example, we *are* recommending it.  If we don't
recommend it, we shouldn't have it as an example.

In fact, if we don't recommend cp, then PostgreSQL should ship with some
example shell scripts for archive commands, just as we do for init scripts.


Not a bad idea. One that supports rsync and another that supports 
robocopy. That should cover every platform we support.


JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-10 Thread Josh Berkus


 Not a bad idea. One that supports rsync and another that supports
 robocopy. That should cover every platform we support.

Example script:

=

#!/usr/bin/env bash

# Simple script to copy WAL archives from one server to another
# to be called as archive_command (call this as wal_archive %p %f)

# Settings.  Please change the below to match your configuration.

# holding directory for the archived log files on the replica
# this is NOT pg_xlog:
WALDIR=/var/lib/pgsql/archive_logs

# touch file to shut off archiving in case of filling up the disk:
NOARCHIVE=/var/lib/pgsql/NOARCHIVE

# replica IP, IPv6 or DNS address:
REPLICA=192.168.1.3

# put any special SSH options here,
# and the location of RSYNC:
export RSYNC_RSH=ssh
RSYNC=/usr/bin/rsync

 DO NOT CHANGE THINGS BELOW THIS LINE ##

SOURCE=$1 # %p
FILE=$2 # %f
DEST=${WALDIR}/${FILE}

# See whether we want all archiving off
test -f ${NOARCHIVE}  exit 0

# Copy the file to the spool area on the replica, error out if
# the transfer fails
${RSYNC} --quiet --archive --rsync-path=${RSYNC} ${SOURCE} \
${REPLICA}:${DEST}

if [ $? -ne 0 ]; then
exit 1
fi

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer

On 06/09/2013 08:32 AM, MauMau wrote:

 - Failure of a disk containing data directory or tablespace
 If checkpoint can't write buffers to disk because of disk failure,
 checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
 This means that one disk failure will lead to postgres shutdown.
... which is why tablespaces aren't disposable, and why creating a
tablespace in a RAM disk is such an awful idea.

I'd rather like to be able to recover from this by treating the
tablespace as dead, so any attempt to get a lock on any table within it
fails with an error and already-in-WAL writes to it just get discarded.
It's the sort of thing that'd only be reasonable to do as a recovery
option (like zero_damaged_pages) since if applied by default it'd lead
to potentially severe and unexpected data loss.

I've seen a couple of people bitten by the misunderstanding that
tablespaces are a way to split up your data based on different
reliability requirements, and I really need to write a docs patch for
http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html
http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html
that adds a prominent warning like:

WARNING: Every tablespace must be present before the database can be
started. There is no easy way to recover the database if a tablespace is
lost to disk failure, deletion, use of volatile storage, etc. bDo not
put a tablespace on a RAM disk/b; instead just use UNLOGGED tables.

(Opinions on the above?)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer

On 06/09/2013 03:02 AM, Jeff Janes wrote:

 It would be nice to have the ability to specify multiple log destinations
 with different log_min_messages for each one.  I'm sure syslog already must
 implement some kind of method for doing that, but I've been happy enough
 with the text logs that I've never bothered to look into it much.

Smarter syslog flavours like rsyslog certainly do this.

No alert system triggered by events within Pg will ever be fully
sufficient. Oops, the postmaster crashed with stack corruption, I'll
just exec whatever's in this on_panic_exec GUC (if I can still read it
and it's still valid) to hopefully tell the sysadmin about my death.
Hmm, sounds as reliable and safe as a bicycle powered by a home-made
rocket engine.

External monitoring is IMO always necessary. Something like Icinga with
check_postgres can trivially poke Pg to make sure it's alive. It can
also efficiently check the 'pg_error.log' from rsyslog that contains
only severe errors and raise alerts if it doesn't like what it sees.

If I'm already doing external monitoring (which is necessary as outlined
above) then I see much less point having Pg able to raise alerts for
problems, and am more interested in better built-in functions and views
for exposing Pg's state. Easier monitoring of WAL build-up, ways to slow
the master if async replicas or archiving are getting too far behind, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Andres Freund

On 2013-06-08 13:26:56 -0700, Joshua D. Drake wrote:
 At the points where the XLogInsert()s happens we're in critical sections
 out of which we *cannot* ERROR out because we already may have made
 modifications that cannot be allowed to be performed
 partially/unlogged. That's why we're throwing a PANIC which will force a
 cluster wide restart including *NOT* writing any further buffers from
 s_b out.
 
 
 Does this preclude (sorry I don't know this part of the code very well) my
 suggestion of on log create?

Well, yes. We create log segments some layers below XLogInsert() if
necesary, and as I said above, we're in a critical section at that
point, so just rolling back isn't one of the options.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread MauMau


From: Craig Ringer cr...@2ndquadrant.com

On 06/09/2013 08:32 AM, MauMau wrote:


- Failure of a disk containing data directory or tablespace
If checkpoint can't write buffers to disk because of disk failure,
checkpoint cannot complete, thus WAL files accumulate in pg_xlog/.
This means that one disk failure will lead to postgres shutdown.


I've seen a couple of people bitten by the misunderstanding that
tablespaces are a way to split up your data based on different
reliability requirements, and I really need to write a docs patch for
http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html
http://www.postgresql.org/docs/9.2/static/manage-ag-tablespaces.html
that adds a prominent warning like:

WARNING: Every tablespace must be present before the database can be
started. There is no easy way to recover the database if a tablespace is
lost to disk failure, deletion, use of volatile storage, etc. bDo not
put a tablespace on a RAM disk/b; instead just use UNLOGGED tables.

(Opinions on the above?)


Yes, I'm sure this is useful for DBAs to know how postgres behaves and take 
some preparations.  However, this does not apply to my case, because I'm 
using tablespaces for I/O distribution across multiple disks and simply for 
database capacity.


The problem is that the reliability of the database system decreases with 
more disks, because failure of any one of those disks would result in a 
database PANIC shutdown




I'd rather like to be able to recover from this by treating the
tablespace as dead, so any attempt to get a lock on any table within it
fails with an error and already-in-WAL writes to it just get discarded.
It's the sort of thing that'd only be reasonable to do as a recovery
option (like zero_damaged_pages) since if applied by default it'd lead
to potentially severe and unexpected data loss.


I'm in favor of taking a tablespace offline when I/O failure is encountered, 
and continue running the database server.  But WAL must not be discarded 
because committed transactions must be preserved for durability of ACID.


Postgres needs to take these steps when it encounters an I/O error:

1. Take the tablespace offline, so that subsequent read/write against it 
returns an error without actually issuing read/write against data files.


2. Discard shared buffers containing data in the tablespace.

WAL is not affected by the offlining of tablespaces.  WAL records already 
written on the WAL buffer will be written to pg_xlog/ and archived as usual. 
Those WAL records will be used to recover committed transactions during 
archive recovery.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-09 Thread Craig Ringer

On 06/10/2013 06:39 AM, MauMau wrote:

 The problem is that the reliability of the database system decreases
 with more disks, because failure of any one of those disks would result
 in a database PANIC shutdown

More specifically, with more independent sets of disks / file systems.

 I'd rather like to be able to recover from this by treating the
 tablespace as dead, so any attempt to get a lock on any table within it
 fails with an error and already-in-WAL writes to it just get discarded.
 It's the sort of thing that'd only be reasonable to do as a recovery
 option (like zero_damaged_pages) since if applied by default it'd lead
 to potentially severe and unexpected data loss.
 
 I'm in favor of taking a tablespace offline when I/O failure is
 encountered, and continue running the database server.  But WAL must not
 be discarded because committed transactions must be preserved for
 durability of ACID.
[snip]
 WAL is not affected by the offlining of tablespaces.  WAL records
 already written on the WAL buffer will be written to pg_xlog/ and
 archived as usual. Those WAL records will be used to recover committed
 transactions during archive recovery.

(I'm still learning the details of Pg's WAL, WAL replay and recovery, so
the below's just my understanding):

The problem is that WAL for all tablespaces is mixed together in the
archives. If you lose your tablespace then you have to keep *all* WAL
around and replay *all* of it again when the tablespace comes back
online. This would be very inefficient, would require a lot of tricks to
cope with applying WAL to a database that has an on-disk state in the
future as far as the archives are concerned. It's not as simple as just
replaying all WAL all over again - as I understand it, things like
CLUSTER or TRUNCATE will result in relfilenodes not being where they're
expected to be as far as old WAL archives are concerned. Selective
replay would be required, and that leaves the door open to all sorts of
new and exciting bugs in areas that'd hardly ever get tested.

To solve the massive disk space explosion problem I imagine we'd have to
have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
and loss of the rather nice property that WAL writes are nice sequential
writes. It'd be complicated and probably cause nightmares during
recovery, for archive-based replication, etc.

The only other thing I can think of is: When a tablespace is offline,
write WAL records to a separate tablespace recovery log as they're
encountered. Replay this log when the tablespace comes is restored,
before applying any other new WAL to the tablespace. This wouldn't
affect archive-based recovery since it'd already have the records from
the original WAL.

None of these options seem exactly simple or pretty, especially given
the additional complexities that'd be involved in allowing WAL records
to be applied out-of-order, something that AFAIK _never_h happens at the
moment.

The key problem, of course, is that this all sounds like a lot of
complicated work for a case that's not really supposed to happen. Right
now, the answer is your database is unrecoverable, switch to your
streaming warm standby and re-seed it from the standby. Not pretty, but
at least there's the option of using a sync standby and avoiding data loss.

How would you approach this?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)


From: Daniel Farina dan...@heroku.com

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus j...@agliodbs.com wrote:

Right now, what we're telling users is You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night.

At which point most sensible users say no thanks, I'll use something 
else.



Inverted and just as well supported: if you want to not accidentally
lose data, you better hire an expensive consultant to check your
systems for all sorts of default 'safety = off' features.  This
being but the hypothetical first one.

Furthermore, I see no reason why high quality external archiving
software cannot exist.  Maybe some even exists already, and no doubt
they can be improved and the contract with Postgres enriched to that
purpose.

Finally, it's not that hard to teach any archiver how to no-op at
user-peril, or perhaps Postgres can learn a way to do this expressly
to standardize the procedure a bit to ease publicly shared recipes, 
perhaps.


Yes, I feel designing reliable archiving, even for the simplest case - copy 
WAL to disk, is very difficult.  I know there are following three problems 
if you just follow the PostgreSQL manual.  Average users won't notice them. 
I guess even professional DBAs migrating from other DBMSs won't, either.


1. If the machine or postgres crashes while archive_command is copying a WAL 
file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and 
postgres refuses to start when it finds such a small archive WAL file.
The solution, which IIRC Tomas san told me here, is to do like cp %p 
/archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.


2. archive_command dumps core when you run pg_ctl stop -mi.
This is because postmaster sends SIGQUIT to all its descendants.  The core 
files accumulate in the data directory, which will be backed up with the 
database.  Of course those core files are garbage.

archive_command script needs to catch SIGQUIT and exit.

3. You cannot know the reason of archive_command failure (e.g. archive area 
full) if you don't use PostgreSQL's server logging.

This is because archive_command failure is not logged in syslog/eventlog.


I hope PostgreSQL will provide a reliable archiving facility that is ready 
to use.


Regards
MauMau







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)



On 06/07/2013 12:14 PM, Josh Berkus wrote:


Right now, what we're telling users is You can have continuous backup
with Postgres, but you'd better hire and expensive consultant to set it
up for you, or use this external tool of dubious provenance which
there's no packages for, or you might accidentally cause your database
to shut down in the middle of the night.


This is an outright falsehood. We are telling them, You better know 
what you are doing or You should call a consultant. This is no 
different than, You better know what you are doing or You should take 
driving lessons.




At which point most sensible users say no thanks, I'll use something else.



Josh I have always admired your flair for dramatics, it almost rivals 
mine. Users are free to use what they want, some will chose lesser 
databases. I am ok with that because eventually if PostgreSQL is the 
right tool, they will come back to us, and PgExperts or CMD or OmniTI or 
they will know what they are doing and thus don't need us.


JD


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)



On 06/08/2013 07:36 AM, MauMau wrote:


1. If the machine or postgres crashes while archive_command is copying a
WAL file, later archive recovery fails.
This is because cp leaves a file of less than 16MB in archive area, and
postgres refuses to start when it finds such a small archive WAL file.
The solution, which IIRC Tomas san told me here, is to do like cp %p
/archive/dir/%f.tmp  mv /archive/dir/%f.tmp /archive/dir/%f.


Well it seems to me that one of the problems here is we tell people to 
use copy. We should be telling people to use a command (or supply a 
command) that is smarter than that.



3. You cannot know the reason of archive_command failure (e.g. archive
area full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.


Wait, what? Is this true (someone else?)

JD





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)



On 06/06/2013 07:52 AM, Heikki Linnakangas wrote:

I think it can be made fairly robust otherwise, and the performance
impact should be pretty easy to measure with e.g pgbench.


Once upon a time in a land far, far away, we expected users to manage 
their own systems. We had things like soft and hard quotas on disks and 
last log to find out who was logging into the system. Alas, as far as I 
know soft and hard quotas are kind of a thing of the past but that 
doesn't mean that their usefulness has ended.


The idea that we PANIC is not just awful, it is stupid. I don't think 
anyone is going to disagree with that. However, there is a question of 
what to do instead. I think the idea of sprinkling checks into the 
higher level code before specific operations is not invalid but I also 
don't think it is necessary.


To me, a more pragmatic approach makes sense. Obviously having some kind 
of code that checks the space makes sense but I don't know that it needs 
to be around any operation other than we are creating a segment. What do 
we care why the segment is being created? If we don't have enough room 
to create the segment, the transaction rollsback with some OBVIOUS not 
OBTUSE error.


Obviously this could cause a ton of transactions to roll back but I 
think keeping the database consistent and rolling back a transaction in 
case of error is exactly what we are supposed to do.


Sincerely,

Joshua D. Drake





- Heikki






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Andres Freund

On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote:
 To me, a more pragmatic approach makes sense. Obviously having some kind of
 code that checks the space makes sense but I don't know that it needs to be
 around any operation other than we are creating a segment. What do we care
 why the segment is being created? If we don't have enough room to create the
 segment, the transaction rollsback with some OBVIOUS not OBTUSE error.
 
 Obviously this could cause a ton of transactions to roll back but I think
 keeping the database consistent and rolling back a transaction in case of
 error is exactly what we are supposed to do.

You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.
At the points where the XLogInsert()s happens we're in critical sections
out of which we *cannot* ERROR out because we already may have made
modifications that cannot be allowed to be performed
partially/unlogged. That's why we're throwing a PANIC which will force a
cluster wide restart including *NOT* writing any further buffers from
s_b out.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Andres Freund

On 2013-06-07 12:02:57 +0300, Heikki Linnakangas wrote:
 On 07.06.2013 00:38, Andres Freund wrote:
 On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
 * Heikki Linnakangas wrote:
 
 The current situation is that if you run out of disk space while writing
 WAL, you get a PANIC, and the server shuts down. That's awful. We can
 
 So we need to somehow stop new WAL insertions from happening, before
 it's too late.
 
 A naive idea is to check if there's enough preallocated WAL space, just
 before inserting the WAL record. However, it's too late to check that in
 
 There is a database engine, Microsoft's Jet Blue aka the Extensible
 Storage Engine, that just keeps some preallocated log files around,
 specifically so it can get consistent and halt cleanly if it runs out of
 disk space.
 
 In other words, the idea is not to check over and over again that there is
 enough already-reserved WAL space, but to make sure there always is by
 having a preallocated segment that is never used outside a disk space
 emergency.
 
 That's not a bad technique. I wonder how reliable it would be in
 postgres.
 
 That's no different from just having a bit more WAL space in the first
 place. We need a mechanism to stop backends from writing WAL, before you run
 out of it completely. It doesn't matter if the reservation is done by
 stashing away a WAL segment for emergency use, or by a variable in shared
 memory. Either way, backends need to stop using it up, by blocking or
 throwing an error before they enter the critical section.

Well, if you have 16 or 32MB of reserved WAL space available you don't
need to judge all that precisely how much space is available.

So we can just sprinkle some EnsureXLogHasSpace() on XLogInsert()
callsites like heap_insert(), but we can do that outside of the critical
sections and we can do it without locks since there needs to happen
quite some write activity to overrun the reserved space. Anything that
desparately needs to write stuff, like the end of recovery checkpoint,
can just not call EnsureXLogHasSpace() and rely on the reserved space.

Seems like 90% of the solution for 30% of the complexity or so.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Jeff Janes

On Fri, Jun 7, 2013 at 12:14 PM, Josh Berkus j...@agliodbs.com wrote:


  The archive command can be made a shell script (or that matter a
  compiled program) which can do anything it wants upon failure, including
  emailing people.

 You're talking about using external tools -- frequently hackish,
 workaround ones -- to handle something which PostgreSQL should be doing
 itself, and which only the database engine has full knowledge of.


I think the database engine is about the last thing which would have full
knowledge of the best way to contact the DBA, especially during events
which by definition mean things are already going badly.  I certainly don't
see having core code which knows how to talk to every PBX, SMS, email
system, or twitter feed that anyone might wish to use for logging.
PostgreSQL already supports two formats of text logs, plus syslog and
eventlog.  Is there some additional logging management tool that we could
support which is widely used, doesn't require an expensive consultant to
set-up and configure correctly (or even to decide what correctly means
for the given situation), and which solves 80% of the problems?

It would be nice to have the ability to specify multiple log destinations
with different log_min_messages for each one.  I'm sure syslog already must
implement some kind of method for doing that, but I've been happy enough
with the text logs that I've never bothered to look into it much.


 While
 that's the only solution we have for now, it's hardly a worthy design goal.

 Right now, what we're telling users is You can have continuous backup
 with Postgres, but you'd better hire and expensive consultant to set it
 up for you, or use this external tool of dubious provenance which
 there's no packages for, or you might accidentally cause your database
 to shut down in the middle of the night.

 At which point most sensible users say no thanks, I'll use something
 else.


What does the something else do?  Hopefully it is not silently invalidate
your backups.

Cheers,

Jeff

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Jeff Janes

On Sat, Jun 8, 2013 at 11:15 AM, Joshua D. Drake j...@commandprompt.comwrote:


 On 06/06/2013 07:52 AM, Heikki Linnakangas wrote:

 I think it can be made fairly robust otherwise, and the performance
 impact should be pretty easy to measure with e.g pgbench.


 Once upon a time in a land far, far away, we expected users to manage
 their own systems. We had things like soft and hard quotas on disks and
 last log to find out who was logging into the system. Alas, as far as I
 know soft and hard quotas are kind of a thing of the past but that doesn't
 mean that their usefulness has ended.

 The idea that we PANIC is not just awful, it is stupid. I don't think
 anyone is going to disagree with that. However, there is a question of what
 to do instead. I think the idea of sprinkling checks into the higher level
 code before specific operations is not invalid but I also don't think it is
 necessary.


Given that the system is going to become unusable, I don't see why PANIC is
an awful, stupid way of doing it.  And if it can only be used for things
that don't generate WAL, that is pretty much unusable, as even read only
transactions often need to do clean-up tasks that generate WAL.

Cheers,

Jeff

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)



On 06/08/2013 11:27 AM, Andres Freund wrote:


On 2013-06-08 11:15:40 -0700, Joshua D. Drake wrote:

To me, a more pragmatic approach makes sense. Obviously having some kind of
code that checks the space makes sense but I don't know that it needs to be
around any operation other than we are creating a segment. What do we care
why the segment is being created? If we don't have enough room to create the
segment, the transaction rollsback with some OBVIOUS not OBTUSE error.

Obviously this could cause a ton of transactions to roll back but I think
keeping the database consistent and rolling back a transaction in case of
error is exactly what we are supposed to do.


You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.


Yes I know we aren't trying to piss off users. What I am saying is that 
it is stupid to the user that it PANICS. I apologize if that came out wrong.



At the points where the XLogInsert()s happens we're in critical sections
out of which we *cannot* ERROR out because we already may have made
modifications that cannot be allowed to be performed
partially/unlogged. That's why we're throwing a PANIC which will force a
cluster wide restart including *NOT* writing any further buffers from
s_b out.



Does this preclude (sorry I don't know this part of the code very well) 
my suggestion of on log create?


JD



Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Simon Riggs

On 7 June 2013 10:02, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 07.06.2013 00:38, Andres Freund wrote:

 On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:

 * Heikki Linnakangas wrote:

 The current situation is that if you run out of disk space while writing
 WAL, you get a PANIC, and the server shuts down. That's awful. We can


 So we need to somehow stop new WAL insertions from happening, before
 it's too late.


 A naive idea is to check if there's enough preallocated WAL space, just
 before inserting the WAL record. However, it's too late to check that in


 There is a database engine, Microsoft's Jet Blue aka the Extensible
 Storage Engine, that just keeps some preallocated log files around,
 specifically so it can get consistent and halt cleanly if it runs out of
 disk space.

 In other words, the idea is not to check over and over again that there
 is
 enough already-reserved WAL space, but to make sure there always is by
 having a preallocated segment that is never used outside a disk space
 emergency.


 That's not a bad technique. I wonder how reliable it would be in
 postgres.


 That's no different from just having a bit more WAL space in the first
 place. We need a mechanism to stop backends from writing WAL, before you run
 out of it completely. It doesn't matter if the reservation is done by
 stashing away a WAL segment for emergency use, or by a variable in shared
 memory. Either way, backends need to stop using it up, by blocking or
 throwing an error before they enter the critical section.

 I guess you could use the stashed away segment to ensure that you can
 recover after PANIC. At recovery, there are no other backends that could use
 up the emergency segment. But that's not much better than what we have now.

Christian's idea seems good to me. Looks like you could be dismissing
this too early, especially since there's no better idea emerged.

I doubt that we're going to think of something others didn't already
face. It's pretty clear that most other DBMS do this with their logs.

For your fast wal insert patch to work well, we need a simple and fast
technique to detect out of space.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-08 Thread Jeff Janes

On Sat, Jun 8, 2013 at 11:27 AM, Andres Freund and...@2ndquadrant.comwrote:


 You know, the PANIC isn't there just because we like to piss of
 users. There's actual technical reasons that don't just go away by
 judging the PANIC as stupid.
 At the points where the XLogInsert()s happens we're in critical sections
 out of which we *cannot* ERROR out because we already may have made
 modifications that cannot be allowed to be performed
 partially/unlogged. That's why we're throwing a PANIC which will force a
 cluster wide restart including *NOT* writing any further buffers from
 s_b out.


If archiving is on and failure is due to no space, could we just keep
trying XLogFileInit again for a couple minutes to give archiving a chance
to do its things?  Doing that while holding onto locks and a critical
section would be unfortunate, but if the alternative is a PANIC, it might
be acceptable.

The problem is that even if the file is only being kept so it can be
archived, once archiving succeeds I think the file is not removed
immediately but rather not until the next checkpoint, which will never
happen when the locks are still held.

Cheers,

Jeff

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)


From: Joshua D. Drake j...@commandprompt.com

On 06/08/2013 07:36 AM, MauMau wrote:

3. You cannot know the reason of archive_command failure (e.g. archive
area full) if you don't use PostgreSQL's server logging.
This is because archive_command failure is not logged in syslog/eventlog.


Wait, what? Is this true (someone else?)


I'm sorry, please let me correct myself.  What I meant is:

This is because the exact reason for archive_command failure (e.g. cp's 
output) is not logged in syslog/eventlog.  The fact itself that 
archive_command failed is recorded.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)


From: Joshua D. Drake j...@commandprompt.com

On 06/08/2013 11:27 AM, Andres Freund wrote:

You know, the PANIC isn't there just because we like to piss of
users. There's actual technical reasons that don't just go away by
judging the PANIC as stupid.


Yes I know we aren't trying to piss off users. What I am saying is that it 
is stupid to the user that it PANICS. I apologize if that came out wrong.


I've experienced PANIC shutdown several times due to WAL full during 
development.  As a user, one problem with PANIC is that it dumps core.  I 
think core files should be dumped only when can't happen events has 
occurred like PostgreSQL's bug.  I didn't expect postgres dumps core simply 
because disk is full.   I want postgres to shutdown with FATAL message in 
that exact case.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)