Re: [HACKERS] Hot Standby b-tree delete records review

2010-11-09 Thread Simon Riggs
On Tue, 2010-11-09 at 13:34 +0200, Heikki Linnakangas wrote:
> (cleaning up my inbox, and bumped into this..)
> 
> On 22.04.2010 12:31, Simon Riggs wrote:
> > On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> >>>
>  If none of the removed heap tuples were present anymore, we currently
>  return InvalidTransactionId, which kills/waits out all read-only
>  queries. But if none of the tuples were present anymore, the 
>  read-only
>  queries wouldn't have seen them anyway, so ISTM that we should treat
>  InvalidTransactionId return value as "we don't need to kill anyone".
> >>> That's not the point. The tuples were not themselves the sole focus,
> >> Yes, they were. We're replaying a b-tree deletion record, which removes
> >> pointers to some heap tuples, making them unreachable to any read-only
> >> queries. If any of them still need to be visible to read-only queries,
> >> we have a conflict. But if all of the heap tuples are gone already,
> >> removing the index pointers to them can'ẗ change the situation for any
> >> query. If any of them should've been visible to a query, the damage was
> >> done already by whoever pruned the heap tuples leaving just the
> >> tombstone LP_DEAD item pointers (in the heap) behind.
> > You're missing my point. Those tuples are indicators of what may lie
> > elsewhere in the database, completely unreferenced by this WAL record.
> > Just because these referenced tuples are gone doesn't imply that all
> > tuple versions written by the as yet-unknown-xids are also gone. We
> > can't infer anything about the whole database just from one small group
> > of records.
>  Have you got an example of that?
> >>>
> >>> I don't need one, I have suggested the safe route. In order to infer
> >>> anything, and thereby further optimise things, we would need proof that
> >>> no cases can exist, which I don't have. Perhaps we can add "yet", not
> >>> sure about that either.
> >>
> >> It's good to be safe rather than sorry, but I'd still like to know
> >> because I'm quite surprised by that, and got me worried that I don't
> >> understand how hot standby works as well as I thought I did. I thought
> >> the point of stopping replay/killing queries at a b-tree deletion record
> >> is precisely that it makes some heap tuples invisible to running
> >> read-only queries. If it doesn't make any tuples invisible, why do any
> >> queries need to be killed? And why was it OK for them to be running just
> >> before replaying the b-tree deletion record?
> >
> > I'm sorry but I'm too busy to talk further on this today. Since we are
> > discussing a further optimisation rather than a bug, I hope it is OK to
> > come back to this again later.
> 
> Would now be a good time to revisit this? I still don't see why a b-tree 
> deletion record should conflict with anything, if all the removed index 
> tuples point to just LP_DEAD tombstones in the heap.

I want what you say to be true. The question is: is it? We just need to
explain why that will never be a problem.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-09-27 Thread Robert Haas
On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs  wrote:
> Simple tuning of btree_xlog_vacuum() using an idea I had a while back,
> just never implemented. XXX comments removed.
>
> Allows us to avoid reading in blocks during VACUUM replay that are only
> required for correctness of index scans.

Review:

1. The block comment in XLogConfirmBufferIsUnpinned appears to be
copied from somewhere else, and doesn't really seem appropriate for a
new function since it refers to "the original coding of this routine".
 I think you could just delete the parenthesized portion of the
comment.

2. This bit from ConfirmBufferIsUnpinned looks odd to me.

+   /*
+* Found it.  Now, pin/unpin the buffer to prove it's unpinned.
+*/
+   if (PinBuffer(buf, NULL))
+   UnpinBuffer(buf, false);

I don't think pinning and unpinning the buffer is sufficient to
provide that it isn't otherwise pinned.  If the buffer isn't in shared
buffers at all, it seems clear that no one can have it pinned.  But if
it's present in shared buffers, it seems like you still need
LockBufferForCleanup().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-05-17 Thread Simon Riggs
On Mon, 2010-05-17 at 16:10 -0400, Tom Lane wrote:
> Jim Nasby  writes:
> > On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:
> >> This is not the time to be hacking stuff like this.  You haven't even
> >> demonstrated that there's a significant performance issue here.
> 
> > I tend to agree that this point of the cycle isn't a good one to be making 
> > changes, but your performance statement confuses me. If a fairly small 
> > patch means we can avoid un-necessary reads why shouldn't we avoid them?
> 
> Well, by "time of the cycle" I meant "the day before beta1".  I'm not
> necessarily averse to making such a change at some point when it would
> get more than no testing before hitting our long-suffering beta testers.
> But I'd still want to see some evidence that there's a significant
> performance improvement to be had.

That patch only applies to one record type. However, since we've used
Greg's design of spidering out to each heap record that can usually mean
150-200 random I/Os per btree delete. That will take some time, perhaps
1s per WAL record of this type on a very large I/O bound table. That's
enough to give me cause for concern without performance measurements.

To derive such measurements we'd need to instrument each record type,
which we don't do right now either.

It might be easier to have a look at the patch and see if you think its
worth the fuss of measuring it. 

I don't think this is the patch that will correct the potential/
partially observed context switching issue, but we have yet to recreate
that in lab conditions.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-05-17 Thread Tom Lane
Jim Nasby  writes:
> On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:
>> This is not the time to be hacking stuff like this.  You haven't even
>> demonstrated that there's a significant performance issue here.

> I tend to agree that this point of the cycle isn't a good one to be making 
> changes, but your performance statement confuses me. If a fairly small patch 
> means we can avoid un-necessary reads why shouldn't we avoid them?

Well, by "time of the cycle" I meant "the day before beta1".  I'm not
necessarily averse to making such a change at some point when it would
get more than no testing before hitting our long-suffering beta testers.
But I'd still want to see some evidence that there's a significant
performance improvement to be had.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-05-17 Thread Jim Nasby
On Apr 29, 2010, at 3:20 PM, Tom Lane wrote:
> Simon Riggs  writes:
>> Objections to commit?
> 
> This is not the time to be hacking stuff like this.  You haven't even
> demonstrated that there's a significant performance issue here.

I tend to agree that this point of the cycle isn't a good one to be making 
changes, but your performance statement confuses me. If a fairly small patch 
means we can avoid un-necessary reads why shouldn't we avoid them?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-04-29 Thread Tom Lane
Simon Riggs  writes:
> Objections to commit?

This is not the time to be hacking stuff like this.  You haven't even
demonstrated that there's a significant performance issue here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Simon Riggs
On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> > 
> >> If none of the removed heap tuples were present anymore, we currently
> >> return InvalidTransactionId, which kills/waits out all read-only
> >> queries. But if none of the tuples were present anymore, the read-only
> >> queries wouldn't have seen them anyway, so ISTM that we should treat
> >> InvalidTransactionId return value as "we don't need to kill anyone".
> > That's not the point. The tuples were not themselves the sole focus,
>  Yes, they were. We're replaying a b-tree deletion record, which removes
>  pointers to some heap tuples, making them unreachable to any read-only
>  queries. If any of them still need to be visible to read-only queries,
>  we have a conflict. But if all of the heap tuples are gone already,
>  removing the index pointers to them can'ẗ change the situation for any
>  query. If any of them should've been visible to a query, the damage was
>  done already by whoever pruned the heap tuples leaving just the
>  tombstone LP_DEAD item pointers (in the heap) behind.
> >>> You're missing my point. Those tuples are indicators of what may lie
> >>> elsewhere in the database, completely unreferenced by this WAL record.
> >>> Just because these referenced tuples are gone doesn't imply that all
> >>> tuple versions written by the as yet-unknown-xids are also gone. We
> >>> can't infer anything about the whole database just from one small group
> >>> of records.
> >> Have you got an example of that?
> > 
> > I don't need one, I have suggested the safe route. In order to infer
> > anything, and thereby further optimise things, we would need proof that
> > no cases can exist, which I don't have. Perhaps we can add "yet", not
> > sure about that either.
> 
> It's good to be safe rather than sorry, but I'd still like to know
> because I'm quite surprised by that, and got me worried that I don't
> understand how hot standby works as well as I thought I did. I thought
> the point of stopping replay/killing queries at a b-tree deletion record
> is precisely that it makes some heap tuples invisible to running
> read-only queries. If it doesn't make any tuples invisible, why do any
> queries need to be killed? And why was it OK for them to be running just
> before replaying the b-tree deletion record?

I'm sorry but I'm too busy to talk further on this today. Since we are
discussing a further optimisation rather than a bug, I hope it is OK to
come back to this again later.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
> 
>> If none of the removed heap tuples were present anymore, we currently
>> return InvalidTransactionId, which kills/waits out all read-only
>> queries. But if none of the tuples were present anymore, the read-only
>> queries wouldn't have seen them anyway, so ISTM that we should treat
>> InvalidTransactionId return value as "we don't need to kill anyone".
> That's not the point. The tuples were not themselves the sole focus,
 Yes, they were. We're replaying a b-tree deletion record, which removes
 pointers to some heap tuples, making them unreachable to any read-only
 queries. If any of them still need to be visible to read-only queries,
 we have a conflict. But if all of the heap tuples are gone already,
 removing the index pointers to them can'ẗ change the situation for any
 query. If any of them should've been visible to a query, the damage was
 done already by whoever pruned the heap tuples leaving just the
 tombstone LP_DEAD item pointers (in the heap) behind.
>>> You're missing my point. Those tuples are indicators of what may lie
>>> elsewhere in the database, completely unreferenced by this WAL record.
>>> Just because these referenced tuples are gone doesn't imply that all
>>> tuple versions written by the as yet-unknown-xids are also gone. We
>>> can't infer anything about the whole database just from one small group
>>> of records.
>> Have you got an example of that?
> 
> I don't need one, I have suggested the safe route. In order to infer
> anything, and thereby further optimise things, we would need proof that
> no cases can exist, which I don't have. Perhaps we can add "yet", not
> sure about that either.

It's good to be safe rather than sorry, but I'd still like to know
because I'm quite surprised by that, and got me worried that I don't
understand how hot standby works as well as I thought I did. I thought
the point of stopping replay/killing queries at a b-tree deletion record
is precisely that it makes some heap tuples invisible to running
read-only queries. If it doesn't make any tuples invisible, why do any
queries need to be killed? And why was it OK for them to be running just
before replaying the b-tree deletion record?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Simon Riggs
On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:

>  If none of the removed heap tuples were present anymore, we currently
>  return InvalidTransactionId, which kills/waits out all read-only
>  queries. But if none of the tuples were present anymore, the read-only
>  queries wouldn't have seen them anyway, so ISTM that we should treat
>  InvalidTransactionId return value as "we don't need to kill anyone".
> >>> That's not the point. The tuples were not themselves the sole focus,
> >> Yes, they were. We're replaying a b-tree deletion record, which removes
> >> pointers to some heap tuples, making them unreachable to any read-only
> >> queries. If any of them still need to be visible to read-only queries,
> >> we have a conflict. But if all of the heap tuples are gone already,
> >> removing the index pointers to them can'ẗ change the situation for any
> >> query. If any of them should've been visible to a query, the damage was
> >> done already by whoever pruned the heap tuples leaving just the
> >> tombstone LP_DEAD item pointers (in the heap) behind.
> > 
> > You're missing my point. Those tuples are indicators of what may lie
> > elsewhere in the database, completely unreferenced by this WAL record.
> > Just because these referenced tuples are gone doesn't imply that all
> > tuple versions written by the as yet-unknown-xids are also gone. We
> > can't infer anything about the whole database just from one small group
> > of records.
> 
> Have you got an example of that?

I don't need one, I have suggested the safe route. In order to infer
anything, and thereby further optimise things, we would need proof that
no cases can exist, which I don't have. Perhaps we can add "yet", not
sure about that either.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
 btree_redo:
>   /*
>* Note that if all heap tuples were LP_DEAD then we will be
>* returning InvalidTransactionId here. This seems very unlikely
>* in practice.
>*/
 If none of the removed heap tuples were present anymore, we currently
 return InvalidTransactionId, which kills/waits out all read-only
 queries. But if none of the tuples were present anymore, the read-only
 queries wouldn't have seen them anyway, so ISTM that we should treat
 InvalidTransactionId return value as "we don't need to kill anyone".
>>> That's not the point. The tuples were not themselves the sole focus,
>> Yes, they were. We're replaying a b-tree deletion record, which removes
>> pointers to some heap tuples, making them unreachable to any read-only
>> queries. If any of them still need to be visible to read-only queries,
>> we have a conflict. But if all of the heap tuples are gone already,
>> removing the index pointers to them can'ẗ change the situation for any
>> query. If any of them should've been visible to a query, the damage was
>> done already by whoever pruned the heap tuples leaving just the
>> tombstone LP_DEAD item pointers (in the heap) behind.
> 
> You're missing my point. Those tuples are indicators of what may lie
> elsewhere in the database, completely unreferenced by this WAL record.
> Just because these referenced tuples are gone doesn't imply that all
> tuple versions written by the as yet-unknown-xids are also gone. We
> can't infer anything about the whole database just from one small group
> of records.

Have you got an example of that?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Simon Riggs
On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
> >> btree_redo:
> >>>   /*
> >>>* Note that if all heap tuples were LP_DEAD then we will be
> >>>* returning InvalidTransactionId here. This seems very unlikely
> >>>* in practice.
> >>>*/
> >> If none of the removed heap tuples were present anymore, we currently
> >> return InvalidTransactionId, which kills/waits out all read-only
> >> queries. But if none of the tuples were present anymore, the read-only
> >> queries wouldn't have seen them anyway, so ISTM that we should treat
> >> InvalidTransactionId return value as "we don't need to kill anyone".
> > 
> > That's not the point. The tuples were not themselves the sole focus,
> 
> Yes, they were. We're replaying a b-tree deletion record, which removes
> pointers to some heap tuples, making them unreachable to any read-only
> queries. If any of them still need to be visible to read-only queries,
> we have a conflict. But if all of the heap tuples are gone already,
> removing the index pointers to them can'ẗ change the situation for any
> query. If any of them should've been visible to a query, the damage was
> done already by whoever pruned the heap tuples leaving just the
> tombstone LP_DEAD item pointers (in the heap) behind.

You're missing my point. Those tuples are indicators of what may lie
elsewhere in the database, completely unreferenced by this WAL record.
Just because these referenced tuples are gone doesn't imply that all
tuple versions written by the as yet-unknown-xids are also gone. We
can't infer anything about the whole database just from one small group
of records.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
>> btree_redo:
>>> /*
>>>  * Note that if all heap tuples were LP_DEAD then we will be
>>>  * returning InvalidTransactionId here. This seems very unlikely
>>>  * in practice.
>>>  */
>> If none of the removed heap tuples were present anymore, we currently
>> return InvalidTransactionId, which kills/waits out all read-only
>> queries. But if none of the tuples were present anymore, the read-only
>> queries wouldn't have seen them anyway, so ISTM that we should treat
>> InvalidTransactionId return value as "we don't need to kill anyone".
> 
> That's not the point. The tuples were not themselves the sole focus,

Yes, they were. We're replaying a b-tree deletion record, which removes
pointers to some heap tuples, making them unreachable to any read-only
queries. If any of them still need to be visible to read-only queries,
we have a conflict. But if all of the heap tuples are gone already,
removing the index pointers to them can'ẗ change the situation for any
query. If any of them should've been visible to a query, the damage was
done already by whoever pruned the heap tuples leaving just the
tombstone LP_DEAD item pointers (in the heap) behind.

Or do we use the latestRemovedXid value for something else as well?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-04-22 Thread Simon Riggs
On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote:
> btree_redo:
> > case XLOG_BTREE_DELETE:
> > 
> > /*
> >  * Btree delete records can conflict with standby queries. You
> >  * might think that vacuum records would conflict as well, but
> >  * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
> >  * provide the highest xid cleaned by the vacuum of the heap
> >  * and so we can resolve any conflicts just once when that
> >  * arrives. After that any we know that no conflicts exist
> >  * from individual btree vacuum records on that index.
> >  */
> > {
> > TransactionId latestRemovedXid = 
> > btree_xlog_delete_get_latestRemovedXid(record);
> > xl_btree_delete *xlrec = (xl_btree_delete *) 
> > XLogRecGetData(record);
> > 
> > /*
> >  * XXX Currently we put everybody on death row, because
> >  * currently _bt_delitems() supplies 
> > InvalidTransactionId.
> >  * This can be fairly painful, so providing a better 
> > value
> >  * here is worth some thought and possibly some effort 
> > to
> >  * improve.
> >  */
> > ResolveRecoveryConflictWithSnapshot(latestRemovedXid, 
> > xlrec->node);
> > }
> > break;
> 
> The XXX comment is out-of-date, the latestRemovedXid value is calculated
> by btree_xlog_delete_get_latestRemovedXid() nowadays.

Removed, thanks.

> If we're re-replaying the WAL record, for example after restarting the
> standby server, btree_xlog_delete_get_latestRemovedXid() won't find the
> deleted records and will return InvalidTransactionId. That's OK, because
> until we reach again the point in WAL where we were before the restart,
> we don't accept read-only connections so there's no-one to kill anyway,
> but you do get a useless "Invalid latestRemovedXid reported, using
> latestCompletedXid instead" message in the log (that shouldn't be
> capitalized, BTW).

> It would be nice to check if there's any potentially conflicting
> read-only queries before calling
> btree_xlog_delete_get_latestRemovedXid(), which can be quite expensive.

Good idea. You're welcome to add such tuning yourself, if you like.

> If the "Invalid latestRemovedXid reported, using latestCompletedXid
> instead" message is going to happen commonly, I think it should be
> downgraded to DEBUG1. If it's an unexpected scenario, it should be
> upgraded to WARNING.

Set to DEBUG because the above optimisation makes it return invalid much
more frequently, which we don't want reported.

> In btree_xlog_delete_get_latestRemovedXid:
> > Assert(num_unused == 0);
> 
> Can't that happen as well in a re-replay scenario, if a heap item was
> vacuumed away later on?

OK, will remove. The re-replay gets me every time.

> > /*
> >  * Note that if all heap tuples were LP_DEAD then we will be
> >  * returning InvalidTransactionId here. This seems very unlikely
> >  * in practice.
> >  */
> 
> If none of the removed heap tuples were present anymore, we currently
> return InvalidTransactionId, which kills/waits out all read-only
> queries. But if none of the tuples were present anymore, the read-only
> queries wouldn't have seen them anyway, so ISTM that we should treat
> InvalidTransactionId return value as "we don't need to kill anyone".

That's not the point. The tuples were not themselves the sole focus,
they indicated the latestRemovedXid of the backend on the primary that
had performed the deletion. So even if those tuples are no longer
present there may be others with similar xids that would conflict, so we
cannot ignore. Comment updated.

> Why does btree_xlog_delete_get_latestRemovedXid() keep the
> num_unused/num_dead/num_redirect counts, it doesn't actually do anything
> with them.

Probably a debug tool. Removed.

Changes committed.

Thanks for the review.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-14 Thread Simon Riggs
On Tue, 2010-04-13 at 17:18 +0300, Heikki Linnakangas wrote:
> I've reviewed your changes and they look correct to me; the main chunk
> > of code is mine and that was tested by me.
> 
> Ok, committed after fixing an obsoleted comment & other small
> editorialization.

Looks good, thanks.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-13 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Thu, 2010-04-08 at 19:02 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> OK, that seems better. I'm happy with that instead.
>>>
>>> Have you tested this? Is it ready to commit?
>> Only very briefly. I think the code is ready, but please review and test
>> to see I didn't miss anything.
> 
> I'm going to need you to commit this. I'm on holiday now until 14 April,
> so its not going to get a retest before then otherwise; its not smart to
> commit and then go on holiday, IIRC. 

:-)

> I've reviewed your changes and they look correct to me; the main chunk
> of code is mine and that was tested by me.

Ok, committed after fixing an obsoleted comment & other small
editorialization.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Simon Riggs
On Thu, 2010-04-08 at 19:02 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > OK, that seems better. I'm happy with that instead.
> > 
> > Have you tested this? Is it ready to commit?
> 
> Only very briefly. I think the code is ready, but please review and test
> to see I didn't miss anything.

I'm going to need you to commit this. I'm on holiday now until 14 April,
so its not going to get a retest before then otherwise; its not smart to
commit and then go on holiday, IIRC. 

I've reviewed your changes and they look correct to me; the main chunk
of code is mine and that was tested by me.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Heikki Linnakangas
Simon Riggs wrote:
> OK, that seems better. I'm happy with that instead.
> 
> Have you tested this? Is it ready to commit?

Only very briefly. I think the code is ready, but please review and test
to see I didn't miss anything.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Simon Riggs
On Thu, 2010-04-08 at 18:35 +0300, Heikki Linnakangas wrote:
> 
> > So I have introduced the new mode ("snapshot mode") to enter hot
> standby
> > anyway. That avoids us having to screw around with the loop logic
> for
> > redo. I don't see any need to support the case of where we have no
> WAL
> > source defined, yet we want Hot Standby but we also want to allow
> > somebody to drop a WAL file into pg_xlog at some future point. That
> has
> > no use case of value AFAICS and is too complex to add at this stage
> of
> > the release cycle.
> 
> You don't need a new mode for that. Just do the same "are we
> consistent now?" check you do in the loop once before calling
> ReadRecord to fetch the record that follows the checkpoint pointer.
> Attached is a patch to show what I mean. We just need to let
> postmaster know that recovery has started a bit earlier, right after
> processing the checkpoint record, not delaying it until we've read the
> first record after it.

OK, that seems better. I'm happy with that instead.

Have you tested this? Is it ready to commit?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Heikki Linnakangas
Simon Riggs wrote:
> In StartupXlog() when we get to the point where we "Find the first
> record that logically follows the checkpoint", in the current code
> ReadRecord() loops forever, spitting out
> LOG: record with zero length at 0/C88
> ...
> 
> That prevents us from going further down StartupXLog() to the point
> where we start the InRedo loop and hence start hot standby. As long as
> we retry we cannot progress further: this is the main problem.
> 
> So in the patch, I have modified the retry test in ReadRecord() so it no
> longer retries iff there is no WAL source defined. Now, when
> ReadRecord() exits, record == NULL at that point and so we do not (and
> cannot) enter the redo loop.

Oh, I see.

> So I have introduced the new mode ("snapshot mode") to enter hot standby
> anyway. That avoids us having to screw around with the loop logic for
> redo. I don't see any need to support the case of where we have no WAL
> source defined, yet we want Hot Standby but we also want to allow
> somebody to drop a WAL file into pg_xlog at some future point. That has
> no use case of value AFAICS and is too complex to add at this stage of
> the release cycle.

You don't need a new mode for that. Just do the same "are we consistent
now?" check you do in the loop once before calling ReadRecord to fetch
the record that follows the checkpoint pointer. Attached is a patch to
show what I mean. We just need to let postmaster know that recovery has
started a bit earlier, right after processing the checkpoint record, not
delaying it until we've read the first record after it.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***
*** 1719,1724  PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
--- 1719,1806 
  }
  
  /*
+  * StandbyRecoverPreparedTransactions
+  *
+  * Scan the pg_twophase directory and setup all the required information to
+  * allow standby queries to treat prepared transactions as still active.
+  * This is never called at the end of recovery - we use
+  * RecoverPreparedTransactions() at that point.
+  *
+  * Currently we simply call SubTransSetParent() for any subxids of prepared
+  * transactions.
+  */
+ void
+ StandbyRecoverPreparedTransactions(bool can_overwrite)
+ {
+ 	DIR		   *cldir;
+ 	struct dirent *clde;
+ 
+ 	cldir = AllocateDir(TWOPHASE_DIR);
+ 	while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
+ 	{
+ 		if (strlen(clde->d_name) == 8 &&
+ 			strspn(clde->d_name, "0123456789ABCDEF") == 8)
+ 		{
+ 			TransactionId xid;
+ 			char	   *buf;
+ 			TwoPhaseFileHeader *hdr;
+ 			TransactionId *subxids;
+ 			int			i;
+ 
+ 			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
+ 
+ 			/* Already processed? */
+ 			if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+ 			{
+ ereport(WARNING,
+ 		(errmsg("removing stale two-phase state file \"%s\"",
+ clde->d_name)));
+ RemoveTwoPhaseFile(xid, true);
+ continue;
+ 			}
+ 
+ 			/* Read and validate file */
+ 			buf = ReadTwoPhaseFile(xid, true);
+ 			if (buf == NULL)
+ 			{
+ ereport(WARNING,
+ 	  (errmsg("removing corrupt two-phase state file \"%s\"",
+ 			  clde->d_name)));
+ RemoveTwoPhaseFile(xid, true);
+ continue;
+ 			}
+ 
+ 			/* Deconstruct header */
+ 			hdr = (TwoPhaseFileHeader *) buf;
+ 			if (!TransactionIdEquals(hdr->xid, xid))
+ 			{
+ ereport(WARNING,
+ 	  (errmsg("removing corrupt two-phase state file \"%s\"",
+ 			  clde->d_name)));
+ RemoveTwoPhaseFile(xid, true);
+ pfree(buf);
+ continue;
+ 			}
+ 
+ 			/*
+ 			 * Examine subtransaction XIDs ... they should all follow main
+ 			 * XID, and they may force us to advance nextXid.
+ 			 */
+ 			subxids = (TransactionId *)
+ (buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
+ 			for (i = 0; i < hdr->nsubxacts; i++)
+ 			{
+ TransactionId subxid = subxids[i];
+ 
+ Assert(TransactionIdFollows(subxid, xid));
+ SubTransSetParent(xid, subxid, can_overwrite);
+ 			}
+ 		}
+ 	}
+ 	FreeDir(cldir);
+ }
+ 
+ /*
   * RecoverPreparedTransactions
   *
   * Scan the pg_twophase directory and reload shared-memory state for each
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 494,499  static XLogRecPtr minRecoveryPoint;		/* local copy of
--- 494,501 
  		 * ControlFile->minRecoveryPoint */
  static bool updateMinRecoveryPoint = true;
  
+ static bool reachedMinRecoveryPoint = false;
+ 
  static bool InRedo = false;
  
  /*
***
*** 547,552  static void ValidateXLOGDirectoryStructure(void);
--- 549,555 
  static void CleanupBackupHistory(void);
  static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
  static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
+ static void CheckRecoveryConsistency(void);
  static bool ValidXLOGHeader(XLogPageHead

Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Robert Haas
On Thu, Apr 8, 2010 at 6:16 AM, Simon Riggs  wrote:
> If standby_mode is enabled and there is no source of WAL, then we get a
> stream of messages saying
>
> LOG:  record with zero length at 0/C88
> ...
>
> but most importantly we never get to the main recovery loop, so Hot
> Standby never gets to start at all. We can't keep retrying the request
> for WAL and at the same time enter the retry loop, executing lots of
> things that expect non-NULL pointers using a NULL xlog pointer.

This is pretty much a corner case, so I don't think it's a good idea
to add a new mode to handle it.  It also seems like it would be pretty
inconsistent if we allow WAL to be dropped in pg_xlog, but only if we
are also doing archive recovery or streaming replication.  If we can't
support this case with the same code path we use otherwise, I think we
should revert to disallowing it.

Having said that, I guess I don't understand how having a source of
WAL solves the problem described above.  Do we always have to read at
least 1 byte of WAL from either SR or the archive before starting up?
If not, why do we need to do so here?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Simon Riggs
On Thu, 2010-04-08 at 13:33 +0300, Heikki Linnakangas wrote:

> > If standby_mode is enabled and there is no source of WAL, then we get a
> > stream of messages saying
> > 
> > LOG:  record with zero length at 0/C88
> > ...
> > 
> > but most importantly we never get to the main recovery loop, so Hot
> > Standby never gets to start at all. We can't keep retrying the request
> > for WAL and at the same time enter the retry loop, executing lots of
> > things that expect non-NULL pointers using a NULL xlog pointer.
> 
> You mean it can't find even the checkpoint record to start replaying? 

Clearly I don't mean that. Otherwise it wouldn't be "start from a
shutdown checkpoint". I think you are misunderstanding me.

Let me explain in more detail though please also read the patch before
replying, if you do.

The patch I submitted at top of this thread works for allowing Hot
Standby during recovery. Yes, of course that occurs when the database is
consistent. The trick is to get recovery to the point where it can be
enabled. The second patch on this thread presents a way to get the
database to that point; it touches some of the other recovery code that
you and Masao have worked on. We *must* touch that code if we are to
enable Hot Standby in the way you desire.

In StartupXlog() when we get to the point where we "Find the first
record that logically follows the checkpoint", in the current code
ReadRecord() loops forever, spitting out
LOG: record with zero length at 0/C88
...

That prevents us from going further down StartupXLog() to the point
where we start the InRedo loop and hence start hot standby. As long as
we retry we cannot progress further: this is the main problem.

So in the patch, I have modified the retry test in ReadRecord() so it no
longer retries iff there is no WAL source defined. Now, when
ReadRecord() exits, record == NULL at that point and so we do not (and
cannot) enter the redo loop.

So I have introduced the new mode ("snapshot mode") to enter hot standby
anyway. That avoids us having to screw around with the loop logic for
redo. I don't see any need to support the case of where we have no WAL
source defined, yet we want Hot Standby but we also want to allow
somebody to drop a WAL file into pg_xlog at some future point. That has
no use case of value AFAICS and is too complex to add at this stage of
the release cycle.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Tue, 2010-04-06 at 10:22 +0100, Simon Riggs wrote:
> 
>> Initial patch. I will be testing over next day. No commit before at
>> least midday on Wed 7 Apr.
> 
> Various previous discussions sidelined a very important point: what
> exactly does it mean to "start recovery from a shutdown checkpoint"?

Hot standby should be possible as soon we know that the database is
consistent. That is, as soon as we've replayed WAL past the
minRecoveryPoint/backupStartPoint point indicated in pg_control.

> If standby_mode is enabled and there is no source of WAL, then we get a
> stream of messages saying
> 
> LOG:  record with zero length at 0/C88
> ...
> 
> but most importantly we never get to the main recovery loop, so Hot
> Standby never gets to start at all. We can't keep retrying the request
> for WAL and at the same time enter the retry loop, executing lots of
> things that expect non-NULL pointers using a NULL xlog pointer.

You mean it can't find even the checkpoint record to start replaying? I
think the behavior in that scenario is fine as it is. The database isn't
consistent (or at least we can't know if it is, because we don't know
the redo pointer) until you read and replay the first checkpoint record.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint

2010-04-08 Thread Simon Riggs
On Tue, 2010-04-06 at 10:22 +0100, Simon Riggs wrote:

> Initial patch. I will be testing over next day. No commit before at
> least midday on Wed 7 Apr.

Various previous discussions sidelined a very important point: what
exactly does it mean to "start recovery from a shutdown checkpoint"?

If standby_mode is enabled and there is no source of WAL, then we get a
stream of messages saying

LOG:  record with zero length at 0/C88
...

but most importantly we never get to the main recovery loop, so Hot
Standby never gets to start at all. We can't keep retrying the request
for WAL and at the same time enter the retry loop, executing lots of
things that expect non-NULL pointers using a NULL xlog pointer.

What we are asking for here is a completely new state: the database is
not "in recovery" - by definition there is nothing at all to recover. 

The following patch adds "Snapshot Mode", a very simple variation on the
existing code - emphasis on the "simple":

LOG:  entering snapshot mode
LOG:  record with zero length at 0/C88
LOG:  consistent recovery state reached at 0/C88
LOG:  database system is ready to accept read only connections

this mode does *not* continually check to see if new WAL files have been
added. Startup just sits and waits, backends allowed. If a trigger file
is specified, then we can leave recovery. Otherwise Startup process just
sits doing nothing.

There's possibly an argument for inventing some more special modes where
we do allow read only connections but don't start the bgwriter. I don't
personally wish to do this at this stage of the release cycle. The
attached patch is non-invasive and safe and I want to leave it at that.

I will be committing later today, unless major objections, but I ask you
to read the patch before you sharpen your pen. It's simple.

-- 
 Simon Riggs   www.2ndQuadrant.com
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e2566a4..365cd17 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1719,6 +1719,88 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 }
 
 /*
+ * StandbyRecoverPreparedTransactions
+ *
+ * Scan the pg_twophase directory and setup all the required information to
+ * allow standby queries to treat prepared transactions as still active.
+ * This is never called at the end of recovery - we use
+ * RecoverPreparedTransactions() at that point.
+ *
+ * Currently we simply call SubTransSetParent() for any subxids of prepared
+ * transactions.
+ */
+void
+StandbyRecoverPreparedTransactions(bool can_overwrite)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(TWOPHASE_DIR);
+	while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == 8 &&
+			strspn(clde->d_name, "0123456789ABCDEF") == 8)
+		{
+			TransactionId xid;
+			char	   *buf;
+			TwoPhaseFileHeader *hdr;
+			TransactionId *subxids;
+			int			i;
+
+			xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
+
+			/* Already processed? */
+			if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
+			{
+ereport(WARNING,
+		(errmsg("removing stale two-phase state file \"%s\"",
+clde->d_name)));
+RemoveTwoPhaseFile(xid, true);
+continue;
+			}
+
+			/* Read and validate file */
+			buf = ReadTwoPhaseFile(xid, true);
+			if (buf == NULL)
+			{
+ereport(WARNING,
+	  (errmsg("removing corrupt two-phase state file \"%s\"",
+			  clde->d_name)));
+RemoveTwoPhaseFile(xid, true);
+continue;
+			}
+
+			/* Deconstruct header */
+			hdr = (TwoPhaseFileHeader *) buf;
+			if (!TransactionIdEquals(hdr->xid, xid))
+			{
+ereport(WARNING,
+	  (errmsg("removing corrupt two-phase state file \"%s\"",
+			  clde->d_name)));
+RemoveTwoPhaseFile(xid, true);
+pfree(buf);
+continue;
+			}
+
+			/*
+			 * Examine subtransaction XIDs ... they should all follow main
+			 * XID, and they may force us to advance nextXid.
+			 */
+			subxids = (TransactionId *)
+(buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
+			for (i = 0; i < hdr->nsubxacts; i++)
+			{
+TransactionId subxid = subxids[i];
+
+Assert(TransactionIdFollows(subxid, xid));
+SubTransSetParent(xid, subxid, can_overwrite);
+			}
+		}
+	}
+	FreeDir(cldir);
+}
+
+/*
  * RecoverPreparedTransactions
  *
  * Scan the pg_twophase directory and reload shared-memory state for each
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3000ab7..d26b369 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -180,6 +180,7 @@ static TimestampTz recoveryLastXTime = 0;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyMode = false;
+static bool SnapshotMode = false;
 static char *PrimaryConnInfo = NULL;
 char	   *TriggerFile = NULL;
 
@@ -3845,7 +3846,7 @@ next_record_is_invalid:
 	}
 
 	/* In standby-mode, keep trying *

Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-03-02 Thread Simon Riggs
On Tue, 2010-03-02 at 12:47 -0800, Marc Munro wrote:

> IIUC this is only a problem for WAL from HOT updates and vacuums.  If no
> vacuums or HOT updates have been performed, there is no risk of
> returning bad data.  So WAL that does not contain HOT updates or vacuums
> could be applied on the standby without risk, even if there are
> long-running queries in play.  This is not a complete solution but may
> reduce the likelihood of queries having to be cancelled.  I guess the
> approach here would be to check WAL before applying it, and only cancel
> queries if the WAL contains HOT updates or vacuums.

That's what we do.

> Taking the idea further, if WAL records contained the tid of the latest
> tuples that were overwritten, even more WAL could be applied without
> having to cancel queries.
> 
> To take it further still, if vacuum on the master could be prevented
> from touching records that are less than max_standby_delay seconds old,
> it would be safe to apply WAL from the very latest vacuum.  I guess HOT
> could be handled similarly though that may eliminate much of the
> advantage of HOT updates.

Thanks for your ideas.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-03-02 Thread Josh Berkus
On 3/2/10 12:47 PM, Marc Munro wrote:
> To take it further still, if vacuum on the master could be prevented
> from touching records that are less than max_standby_delay seconds old,
> it would be safe to apply WAL from the very latest vacuum.  I guess HOT
> could be handled similarly though that may eliminate much of the
> advantage of HOT updates.

Aside from the inability to convert between transcation count and time,
isn't this what vacuum_defer_cleanup_age is supposed to do?  Or does it
not help with HOT updates?


--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-28 Thread Simon Riggs
On Fri, 2010-02-26 at 03:33 -0500, Greg Smith wrote:

> I really hope this discussion can say focused on if and how it's 
> possible to improve this area, with the goal being to deliver a product 
> everyone can be proud of with the full feature set that makes this next 
> release a killer one. The features that have managed to all get into 
> this release already are fantastic, everyone who contributed should be 
> proud of that progress, and it's encouraging that the alpha4 date was 
> nailed.  It would be easy to descend into finger-pointing for why 
> exactly this particular problem is only getting more visibility now, or 
> into schedule-oriented commentary suggesting it must be ignored because 
> it's too late to do anything about it.  I hope everyone appreciates 
> wandering that way will not help make PostgreSQL 9.0 a better release.  
> This issue is so easy to encounter, and looks so bad when it happens, 
> that I feel it could easily lead to an embarrassing situation for the 
> community if something isn't done about it before release.

Thanks Greg. It's a great relief for me to hear someone else say this
and to watch a discussion about this important issue unfold.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-28 Thread Simon Riggs
On Fri, 2010-02-26 at 16:44 -0500, Tom Lane wrote:
> Greg Stark  writes:
> > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane  wrote:
> >> There's *definitely* not going to be enough information in the WAL
> >> stream coming from a master that doesn't think it has HS slaves.
> >> We can't afford to record all that extra stuff in installations for
> >> which it's just useless overhead. BTW, has anyone made any attempt
> >> to measure the performance hit that the patch in its current form is
> >> creating via added WAL entries?
> 
> > What extra entries?
> 
> Locks, just for starters.  I haven't read enough of the code yet to know
> what else Simon added.  In the past it's not been necessary to record
> any transient information in WAL, but now we'll have to.

There is room for technical confusion here, so I'll just add some info.

There was/is potential for performance hit because of the volume of
additional WAL *and* the processing overhead from that additional WAL.
As Heikki points out these turn out to be minimal, though this has been
by careful design.

There is also potential for a performance hit because incoming cleanup
records may conflict with currently executing queries. If we knew for
certain that the master was not sending any cleanup records that would
effect current standby queries we could avoid that overhead altogether.
That's a good reason for integrating a solution.

AccessExclusiveLock lock records are nothing at all to do with that.
They exist simply to prevent obvious correctness issues such as somebody
reading a file while it is being deleted.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Greg Smith

Greg Stark wrote:

On Sun, Feb 28, 2010 at 5:28 AM, Greg Smith  wrote:
  

The idea of the workaround is that if you have a single long-running query
to execute, and you want to make sure it doesn't get canceled because of a
vacuum cleanup, you just have it connect back to the master to keep an open
snapshot the whole time.

Also, I'm not sure this actually works. When your client makes this
additional connection to the master it's connecting at some
transaction in the future from the slave's point of view. The master
could have already vacuumed away some record which the snapshot the
client gets on the slave will have in view.


Right, and there was an additional comment in the docs alluding to some 
sleep time on the master that intends to try and improve thins.  If you 
knew how long archive_timeout was you could try to sleep longer than it 
to try and increase your odds of avoiding an ugly spot.  But there are 
race conditions galore possible here, particularly if your archiver or 
standby catchup is backlogged.



Still it's a handy practical trick even if it isn't 100% guaranteed to
work. But I don't think it provides the basis for something we can
bake in.
  


Agreed on both counts, which is why it's in the current docs as a 
workaround people can consider, but not what I've been advocating as the 
right way to proceed.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Greg Smith

Bruce Momjian wrote:
"The first option is to connect to the primary server and keep a query 
active for as long as needed to run queries on the standby. This 
guarantees that a WAL cleanup record is never generated and query 
conflicts do not occur, as described above. This could be done using 
contrib/dblink  and pg_sleep(), or via other mechanisms."



I am unclear how you would easily advance the snapshot as each query
completes on the slave.
  


The idea of the workaround is that if you have a single long-running 
query to execute, and you want to make sure it doesn't get canceled 
because of a vacuum cleanup, you just have it connect back to the master 
to keep an open snapshot the whole time.  That's basically the same idea 
that vacuum_defer_cleanup_age implements, except you don't have to 
calculate a value--you just hold open the snapshot to do it.


When that query ended, its snapshot would be removed, and then the 
master would advance to whatever the next latest one is.  Nothing 
fancier than that.  The only similarity is that if you made every query 
that happened on the standby do that, it would effectively be the same 
behavior I'm suggesting could be available via the standby->master xmin 
publication.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Bruce Momjian
Greg Smith wrote:
> Josh Berkus wrote:
> 
> > Now that I think about it, the xmin thing really doesn't seem
> > conceptually difficult.  If the slave just opens a 2nd, special query
> > connection back to the master and publishes its oldest xmin there, as
> > far as the master is concerned, it's just another query backend.
> > Could it be that easy?
> >   
> 
> Something just like that is in fact already suggested as a workaround in 
> the Hot Standby manual:
> 
> "The first option is to connect to the primary server and keep a query 
> active for as long as needed to run queries on the standby. This 
> guarantees that a WAL cleanup record is never generated and query 
> conflicts do not occur, as described above. This could be done using 
> contrib/dblink  and pg_sleep(), or via other mechanisms."

I am unclear how you would easily advance the snapshot as each query
completes on the slave.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Greg Smith

Josh Berkus wrote:

Hey, take it easy!  I read Stark's post as tongue-in-cheek, which I
think it was.
  


Yeah, I didn't get that.  We've already exchanged mutual off-list 
apologies for the misunderstanding in both directions, I stopped just 
short of sending flowers.


I did kick off this discussion with noting a clear preference this not 
wander into any personal finger-pointing.  And I am far too displeased 
with the technical situation here to have much of a sense of humor left 
about it either.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Josh Berkus

> Thank you for combining a small personal attack with a selfish
> commentary about how yours is the only valid viewpoint.  Saves me a lot
> of trouble replying to your messages, can just ignore them instead if
> this is how you're going to act.

Hey, take it easy!  I read Stark's post as tongue-in-cheek, which I
think it was.

Though, Stark, if you're going to be flip, I'd suggest using a smiley
next time.

--Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Josh Berkus

> The part I still don't have good visibility on is how much of the
> necessary SR infrastructure needed to support this communications
> channel is already available in some form.  I had though the walsender
> on the master was already receiving messages sometimes from the
> walreceiver on the standby, but I'm getting the impression from Heikki's
> comments that this not the case at all yet.

I don't think asking for a 2nd connection back from the standby to the
master would be horrible for 9.0.  I think it would be quite reasonable,
actually; even with 2 connections per slave, you could still have quite
a few slaves before the # of connections bogged down the master.

--Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Greg Smith

Josh Berkus wrote:


Now that I think about it, the xmin thing really doesn't seem
conceptually difficult.  If the slave just opens a 2nd, special query
connection back to the master and publishes its oldest xmin there, as
far as the master is concerned, it's just another query backend.
Could it be that easy?
  


Something just like that is in fact already suggested as a workaround in 
the Hot Standby manual:


"The first option is to connect to the primary server and keep a query 
active for as long as needed to run queries on the standby. This 
guarantees that a WAL cleanup record is never generated and query 
conflicts do not occur, as described above. This could be done using 
contrib/dblink  and pg_sleep(), or via other mechanisms."


And the idea of doing it mainly in client land has its attractions. 

The main reason I wandered toward asking about it in the context of SR 
is that there's already this open "Standby delay on idle system" issue 
with Hot Standby, and the suggested resolution for that problem involves 
publishing keep-alive data with timestamps over SR.  While all these 
problems and potential solutions have been floating around for a long 
time, as you pointed out, the little flash of insight I had here was 
that it's possible to bundle these two problems together with a combined 
keep-alive timestamp+xmin message that goes in both directions.  That 
removes one serious Hot Standby issue altogether, and adds an additional 
conflict avoidance mechanism for people who want to enable it, all with 
something that needs to get done sooner or later anyway for sync rep.


The part I still don't have good visibility on is how much of the 
necessary SR infrastructure needed to support this communications 
channel is already available in some form.  I had though the walsender 
on the master was already receiving messages sometimes from the 
walreceiver on the standby, but I'm getting the impression from Heikki's 
comments that this not the case at all yet.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Josh Berkus
Greg,

> If you think of it in those terms, the idea that "you need to run PITR
> backup/archive recovery" to not get that behavior isn't an important
> distinction anymore.  If you run SR with the option enabled you could
> get it, any other setup and you won't.

+1.

I always expected that we'd get this kind of behavior with "synch" in
9.1.  I can see that there are two desired modes of behaviour depending
on what the replication config is:

1) Master full-speed, degraded operation on slaves: this is the current
wal_standby_delay approach.  It has the advantage of supporting possibly
hundreds of slaves, and certainly dozens.

2) Master burdened, full operation on slaves:  this is the
publish-xmin-back-to-master approach, which IIRC the core team first
discussed at pgCon 2008 before Simon started work, and which you and Tom
seem to think can be done soon.

I can see people wanting to use either mode depending on their use-case.
 Or, for that matter, using both modes to different slaves.

Now that I think about it, the xmin thing really doesn't seem
conceptually difficult.  If the slave just opens a 2nd, special query
connection back to the master and publishes its oldest xmin there, as
far as the master is concerned, it's just another query backend.
Could it be that easy?

Also, I'm serious about what I suggested earlier for "delay" mode.  We
should have an option for cancelled queries to be immediately retried,
if that's feasible.  It would turn something which is now a major
application design issue (lots of cancelled queries) into just degrated
performance.

Overall, though, I'd say that getting 9.0 out the door relatively
on-time is more important than getting it perfect.  "Release early,
release often" isn't just a mantra; it's a very good idea if you want
your project to keep improving and not bog down and fall apart.

--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Bruce Momjian
Greg Smith wrote:
> Joshua D. Drake wrote:
> > On Sat, 27 Feb 2010 00:43:48 +, Greg Stark  wrote:
> >   
> >> I want my ability to run large batch queries without any performance
> >> or reliability impact on the primary server.
> >> 
> >
> > +1
> >
> > I can use any number of other technologies for high availability.
> >   
> 
> Remove "must be an instant-on failover at the same time" from the 
> requirements and you don't even need 9.0 to handle that, this has been a 
> straightforward to solve problem since 8.2.  It's the combination of HA 
> and queries that make things hard to do.
> 
> If you just want batch queries on another system without being concerned 
> about HA at the same time, the first option is to just fork the base 
> backup and WAL segment delivery to another server and run queries there.

That is a lot of administrative overhead. It is hard to say it is
equivalent to HS.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
> Dimitri Fontaine wrote:
>> Bruce Momjian  writes:
>>> Doesn't the system already adjust the delay based on the length of slave
>>> transactions, e.g. max_standby_delay.  It seems there is no need for a
>>> user switch --- just max_standby_delay really high.
>> Well that GUC looks like it allows to set a compromise between HA and
>> reporting, not to say "do not ever give the priority to the replay while
>> I'm running my reports". At least that's how I understand it.
> 
> max_standby_delay=-1 does that. The documentation needs to be updated to
> reflect that, it currently says:
> 
>> There is no wait-forever setting because of the potential for deadlock which 
>> that setting would introduce. This parameter can only be set in the 
>> postgresql.conf  file or on the server command line. 
> 
> but that is false, -1 means wait forever. Simon removed that option at
> one point, but it was later put back and apparently the documentation
> was never updated.

I've put back the mention of max_standby_delay=-1 option in the docs.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-27 Thread Heikki Linnakangas
Dimitri Fontaine wrote:
> Bruce Momjian  writes:
>> Doesn't the system already adjust the delay based on the length of slave
>> transactions, e.g. max_standby_delay.  It seems there is no need for a
>> user switch --- just max_standby_delay really high.
> 
> Well that GUC looks like it allows to set a compromise between HA and
> reporting, not to say "do not ever give the priority to the replay while
> I'm running my reports". At least that's how I understand it.

max_standby_delay=-1 does that. The documentation needs to be updated to
reflect that, it currently says:

> There is no wait-forever setting because of the potential for deadlock which 
> that setting would introduce. This parameter can only be set in the 
> postgresql.conf  file or on the server command line. 

but that is false, -1 means wait forever. Simon removed that option at
one point, but it was later put back and apparently the documentation
was never updated.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Aidan Van Dyk wrote:

Would we (ya, the royal we) be willing to say that if you want the
benifit of removing the MVCC overhead of long-running queries you need
to run PITR backup/archive recovery, and if you want SR, you get a
closed-loop master-follows-save-xmin behaviour?
  


To turn that question around a little, I think it's reasonable to say 
that closed-loop master-follows-slave-xmin behavior is only practical to 
consider implementing with SR--and even there, it should be optional 
rather than required until there's more field experience on the whole 
thing.  Whether it's the default or not could take a bit of debate to 
sort out too.


If you think of it in those terms, the idea that "you need to run PITR 
backup/archive recovery" to not get that behavior isn't an important 
distinction anymore.  If you run SR with the option enabled you could 
get it, any other setup and you won't.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Aidan Van Dyk
* Greg Smith  [100226 23:39]:

> Just not having the actual query running on the master is such a  
> reduction in damage that I think it's delivering the essence of what  
> people are looking for regardless.  That it might be possible in some  
> cases to additionally avoid the overhead that comes along with any  
> long-running query is a nice bonus, and it's great the design allows for  
> that possibility.  But if that's only possible with risk, heavy  
> tweaking, and possibly some hacks, I'm not sure that's making the right  
> trade-offs for everyone.

Would we (ya, the royal we) be willing to say that if you want the
benifit of removing the MVCC overhead of long-running queries you need
to run PITR backup/archive recovery, and if you want SR, you get a
closed-loop master-follows-save-xmin behaviour?

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Greg Stark wrote:

But if they move from having a plain old PITR warm standby to having
one they can run queries on they might well assume that the big
advantage of having the standby to play with is precisely that they
can do things there that they have never been able to do on the master
previously without causing damage.
  


Just not having the actual query running on the master is such a 
reduction in damage that I think it's delivering the essence of what 
people are looking for regardless.  That it might be possible in some 
cases to additionally avoid the overhead that comes along with any 
long-running query is a nice bonus, and it's great the design allows for 
that possibility.  But if that's only possible with risk, heavy 
tweaking, and possibly some hacks, I'm not sure that's making the right 
trade-offs for everyone.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Joshua D. Drake wrote:

On Sat, 27 Feb 2010 00:43:48 +, Greg Stark  wrote:
  

I want my ability to run large batch queries without any performance
or reliability impact on the primary server.



+1

I can use any number of other technologies for high availability.
  


Remove "must be an instant-on failover at the same time" from the 
requirements and you don't even need 9.0 to handle that, this has been a 
straightforward to solve problem since 8.2.  It's the combination of HA 
and queries that make things hard to do.


If you just want batch queries on another system without being concerned 
about HA at the same time, the first option is to just fork the base 
backup and WAL segment delivery to another server and run queries there.


Some simple filesystem snapshot techniques will also suffice to handle 
it all on the same standby.  Stop warm standby recovery, snapshot, 
trigger the server, run your batch job; once finished, rollback to the 
snapshot, grab the latest segment files, and resume standby catchup.  
Even the lame Linux LVM snapshot features can handle that job--one of my 
coworkers has the whole thing scripted even this is so common.


And if you have to go live because there's a failover, you're back to 
the same "cold standby" situation a large max_standby_delay puts you at, 
so it's not even very different from what you're going to get in 9.0 if 
this is your priority mix.  The new version is just lowering the 
operational complexity involved.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Joshua D. Drake
On Sat, 27 Feb 2010 00:43:48 +, Greg Stark  wrote:
> On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith 
wrote:
>> This is also the reason why the whole "pause recovery" idea is a
>> fruitless
>> path to wander down.  The whole point of this feature is that people
>> have a
>> secondary server available for high-availability, *first and foremost*,
>> but
>> they'd like it to do something more interesting that leave it idle all
>> the
>> time.  The idea that you can hold off on applying standby updates for
>> long
>> enough to run seriously long reports is completely at odds with the
idea
>> of
>> high-availability.

> I want my ability to run large batch queries without any performance
> or reliability impact on the primary server.

+1

I can use any number of other technologies for high availability.

Joshua D. Drake

 
-- 
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Stark
On Sat, Feb 27, 2010 at 2:43 AM, Greg Smith  wrote:
>
> But if you're running the 8 hour report on the master right now, aren't you
> already exposed to a similar pile of bloat issues while it's going?  If I
> have the choice between "sometimes queries will get canceled" vs. "sometimes
> the master will experience the same long-running transaction bloat issues as
> in earlier versions even if the query runs on the standby", I feel like
> leaning toward the latter at least leads to a problem people are used to.

If they move from running these batch queries on the master to running
them on the slave then sure, the situation will be no worse than
before.

But if they move from having a plain old PITR warm standby to having
one they can run queries on they might well assume that the big
advantage of having the standby to play with is precisely that they
can do things there that they have never been able to do on the master
previously without causing damage.

I agree that having queries randomly and unpredictably canceled is
pretty awful. My argument was that max_standby_delay should default to
infinity on the basis that any other value has to be picked based on
actual workloads and SLAs.

My feeling is that there are probably only two types of configurations
that make sense, a HA replica with a low max_standby_delay or a
reporting replica with a high max_standby_delay. Any attempt to
combine the two on the same system will only work if you know your
application well and can make compromises with both.

I would also like to be able to handle load balancing read-only
queries but that will be really tricky. You want up-to-date data and
you want to be able to run moderately complex queries. That kind of
workload may well require synchronous replication to really work
properly.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Greg Stark wrote:

Eh? That's not what I meant at all. Actually it's kind of the exact
opposite of what I meant.
  


Sorry about that--I think we just hit one of those language usage drift 
bits of confusion.  "Sit in the corner" has a very negative tone to it 
in US English and I interpreted your message badly as a result.  A 
Google search for images using that phrase will quickly show you what I 
mean.



What I meant was that your description of the "High Availability first
and foremost" is only one possible use case. Simon in the past
expressed the same single-minded focus on that use case. It's a
perfectly valid use case and I would probably agree if we had to
choose just one it would be the most important.
  


Sure, there are certainly others, and as much as possible more 
flexibility here is a good thing.  What I was suggesting is that if the 
only good way to handle long-running queries has no choice but to 
sacrifice high-availability, which is is the situation if 
max_standby_delay is the approach you use, then the most obvious users 
for this feature are not being well served by that situation.  I would 
guess a large portion of the users looking forward to Hot Standby are in 
the "have an underutilized high-availability standby I'd like to use for 
offloading long running reports", and if there is no way to serve them 
well this feature is missing the mark a bit. 

You really can't do any better without better master/standby integration 
though, and as pointed out a couple of times here that was considered 
and just not followed through on yet.  I'm increasingly concerned that 
nothing else will really do though.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Greg Smith wrote:
> Heikki Linnakangas wrote:
> > One such landmine is that the keepalives need to flow from client to
> > server while the WAL records are flowing from server to client. We'll
> > have to crack that problem for synchronous replication too, but I think
> > that alone is a big enough problem to make this 9.1 material.
> >   
> 
> This seems to be the real sticking point then, given that the 
> xmin/PGPROC side on the master seems logically straightforward.  For 
> some reason I thought the sync rep feature had the reverse message flow 
> already going, and that some other sort of limitation just made it 
> impractical to merge into the main codebase this early.  My hope was 
> that just this particular part could get cherry-picked out of there, and 
> that it might even have been thought about already in that context given 
> the known HS keepalive "serious issue".  If there was a solution or 
> partial solution in progress to that floating around, my thought was 
> that just piggybacking this extra xid info on top of it would be easy 
> enough.
> 
> If there's not already a standby to primary communications backchannel 
> implementation available that can be harvested from that work, your 
> suggestion that this may not be feasible at all for 9.0 seems like a 
> more serious concern than I had thought it was going to be.

I suspect the master could connect to the slave to pull an xid.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Heikki Linnakangas wrote:

One such landmine is that the keepalives need to flow from client to
server while the WAL records are flowing from server to client. We'll
have to crack that problem for synchronous replication too, but I think
that alone is a big enough problem to make this 9.1 material.
  


This seems to be the real sticking point then, given that the 
xmin/PGPROC side on the master seems logically straightforward.  For 
some reason I thought the sync rep feature had the reverse message flow 
already going, and that some other sort of limitation just made it 
impractical to merge into the main codebase this early.  My hope was 
that just this particular part could get cherry-picked out of there, and 
that it might even have been thought about already in that context given 
the known HS keepalive "serious issue".  If there was a solution or 
partial solution in progress to that floating around, my thought was 
that just piggybacking this extra xid info on top of it would be easy 
enough.


If there's not already a standby to primary communications backchannel 
implementation available that can be harvested from that work, your 
suggestion that this may not be feasible at all for 9.0 seems like a 
more serious concern than I had thought it was going to be.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Bruce Momjian wrote:

Well, I think the choice is either you delay vacuum on the master for 8
hours or pile up 8 hours of WAL files on the slave, and delay
application, and make recovery much slower.  It is not clear to me which
option a user would prefer because the bloat on the master might be
permanent.
  


But if you're running the 8 hour report on the master right now, aren't 
you already exposed to a similar pile of bloat issues while it's going?  
If I have the choice between "sometimes queries will get canceled" vs. 
"sometimes the master will experience the same long-running transaction 
bloat issues as in earlier versions even if the query runs on the 
standby", I feel like leaning toward the latter at least leads to a 
problem people are used to. 

This falls into the principle of least astonishment category to me.  
Testing the final design for how transactions get canceled here led me 
to some really unexpected situations, and the downside for a mistake is 
"your query is lost".  Had I instead discovered that sometimes 
long-running transactions on the standby can ripple back to cause a 
maintenance slowdown on the master, that's not great.  But it would not 
have been so surprising, and it won't result in lost query results. 

I think people will expect that their queries cancel because of things 
like DDL changes.  And the existing knobs allow inserting some slack for 
things like locks taking a little bit of time to acquire sometimes.  
What I don't think people will see coming is that a routine update on an 
unrelated table is going to kill a query they might have been waiting 
hours for the result of, just because that update crossed an autovacuum 
threshold for the other table and introduced a dead row cleanup.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Stark
On Sat, Feb 27, 2010 at 1:53 AM, Greg Smith  wrote:
> Greg Stark wrote:
>>
>> Well you can go sit in the same corner as Simon with your high
>> availability servers.
>>
>> I want my ability to run large batch queries without any performance
>> or reliability impact on the primary server.
>>
>
> Thank you for combining a small personal attack with a selfish commentary
> about how yours is the only valid viewpoint.  Saves me a lot of trouble
> replying to your messages, can just ignore them instead if this is how
> you're going to act.

Eh? That's not what I meant at all. Actually it's kind of the exact
opposite of what I meant.

What I meant was that your description of the "High Availability first
and foremost" is only one possible use case. Simon in the past
expressed the same single-minded focus on that use case. It's a
perfectly valid use case and I would probably agree if we had to
choose just one it would be the most important.

But we don't have to choose just one. There are other valid use cases
such as load balancing and isolating your large batch queries from
your production systems. I don't want us to throw out all these other
use cases because we only consider high availability as the only use
case we're interested in.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Greg Stark wrote:

Well you can go sit in the same corner as Simon with your high
availability servers.

I want my ability to run large batch queries without any performance
or reliability impact on the primary server.
  


Thank you for combining a small personal attack with a selfish 
commentary about how yours is the only valid viewpoint.  Saves me a lot 
of trouble replying to your messages, can just ignore them instead if 
this is how you're going to act.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Greg Smith wrote:
> You can think of the idea of passing an xmin back from the standby as 
> being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no 
> standby queries are running, but grows in size to match longer ones. And 
> you don't have to have to know anything to set it correctly; just toggle 
> on the proposed "feedback xid from the standby" feature and you're safe.

Yes, there is no question there is value in passing the xid back to the
master.  My point is that it is like your blog entry:


http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html

you can't have all the options:

1.  agressive vacuum on master
2.  fast recovery of slave
3.  no query cancel on slave

Again, pick two.  Passing the xid back to the master gives you #2 and
#3, and that might be good for some people, but not others.  Do we have
any idea which options most administrators would want?  If we get xid
passback to the master, do we keep the other configuration options?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Stark
On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith  wrote:
> This is also the reason why the whole "pause recovery" idea is a fruitless
> path to wander down.  The whole point of this feature is that people have a
> secondary server available for high-availability, *first and foremost*, but
> they'd like it to do something more interesting that leave it idle all the
> time.  The idea that you can hold off on applying standby updates for long
> enough to run seriously long reports is completely at odds with the idea of
> high-availability.

Well you can go sit in the same corner as Simon with your high
availability servers.

I want my ability to run large batch queries without any performance
or reliability impact on the primary server.

You can have one or the other but you can't get both. If you set
max_standby_delay low then you get your high availability server, if
you set it high you get a useful report server.

If you build sync replication which we don't have today and which will
open another huge can of usability worms when we haven't even finish
bottling the two we've already opened then you lose the lack of impact
on the primary. Suddenly the queries you run on the slaves cause your
production database to bloat. Plus you have extra network connections
which take resources on your master and have to be kept up at all
times or you lose your slaves.

I think the design constraint of not allowing any upstream data flow
is actually very valuable. Eventually we'll have it for sync
replication but it's much better that we've built things incrementally
and can be sure that nothing really depends on it for basic
functionality. This is what allows us to know that the slave imposes
no reliability impact on the master. It's what allows us to know that
everything will work identically regardless of whether you have a
walreceiver running or are running off archived log files.

Remember I wanted to entirely abstract away the walreciever and allow
multiple wal communication methods. I think it would make more sense
to use something like Spread to distribute the logs so the master only
has to send them once and as many slaves as you want can pick them up.
The current architecture doesn't scale very well if you want to have
hundreds of slaves for one master.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Greg Smith wrote:
> Bruce Momjian wrote:
> > Doesn't the system already adjust the delay based on the length of slave
> > transactions, e.g. max_standby_delay.  It seems there is no need for a
> > user switch --- just max_standby_delay really high.
> >   
> 
> The first issue is that you're basically saying "I don't care about high 
> availability anymore" when you increase max_standby_delay to a high 
> value.  Want to offload an 8 hour long batch report every day to the 
> standby?  You can do it with max_standby_delay=8 hours.  But the day 
> your master crashes 7 hours into that, you're in for a long wait before 
> your standby is available while it replays all the queued up segments.  
> Your 'hot standby' has actually turned into the old form of 'cold 
> standby' just when you need it to be responsive.

Well, I think the choice is either you delay vacuum on the master for 8
hours or pile up 8 hours of WAL files on the slave, and delay
application, and make recovery much slower.  It is not clear to me which
option a user would prefer because the bloat on the master might be
permanent.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Bruce Momjian wrote:

5 Early cleanup of data still visible to the current query's
  snapshot

#5 could be handled by using vacuum_defer_cleanup_age on the master.
Why is vacuum_defer_cleanup_age not listed in postgresql.conf?
  


I noticed that myself and fired off a corrective patch to Simon 
yesterday, he said it was intentional but not sure why that is yet. 
We'll sort that out.


You are correct that my suggestion is targeting primarily #5 on this 
list. There are two problems with the possible solutions using that 
parameter though:


-vacuum_defer_cleanup_age is set in a unit that people cannot be 
expected to work in--transactions ids. The UI is essentially useless, 
and there's no obvious way how to make a better one. The best you can do 
will still be really fragile.


-If you increase vacuum_defer_cleanup_age, it's active all the time. 
You're basically making every single transaction that could be cleaned 
up pay for the fact that a query *might* be running on the standby it 
needs to avoid.


You can think of the idea of passing an xmin back from the standby as 
being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no 
standby queries are running, but grows in size to match longer ones. And 
you don't have to have to know anything to set it correctly; just toggle 
on the proposed "feedback xid from the standby" feature and you're safe.


Expecting that anyone will ever set vacuum_defer_cleanup_age correctly 
in the field in its current form is pretty unreasonable I think. Since 
there's no timestamp-based memory of past xid activity, it's difficult 
to convert it to that form instead, and I think something in terms of 
time is what people would like to set this in.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Greg Smith

Bruce Momjian wrote:

Doesn't the system already adjust the delay based on the length of slave
transactions, e.g. max_standby_delay.  It seems there is no need for a
user switch --- just max_standby_delay really high.
  


The first issue is that you're basically saying "I don't care about high 
availability anymore" when you increase max_standby_delay to a high 
value.  Want to offload an 8 hour long batch report every day to the 
standby?  You can do it with max_standby_delay=8 hours.  But the day 
your master crashes 7 hours into that, you're in for a long wait before 
your standby is available while it replays all the queued up segments.  
Your 'hot standby' has actually turned into the old form of 'cold 
standby' just when you need it to be responsive.


This is also the reason why the whole "pause recovery" idea is a 
fruitless path to wander down.  The whole point of this feature is that 
people have a secondary server available for high-availability, *first 
and foremost*, but they'd like it to do something more interesting that 
leave it idle all the time.  The idea that you can hold off on applying 
standby updates for long enough to run seriously long reports is 
completely at odds with the idea of high-availability.


The second major problem is that the day the report actually takes 8.1 
hours instead, because somebody else ran another report that slowed you 
down a little, you're screwed if that's something you depend on being 
available--it just got canceled only *after* wasting 8 hours of 
reporting resource time.


max_standby_delay is IMHO only useful for allowing non-real-time web-app 
style uses of HS (think "Facebook status updates"), where you say "I'm 
OK giving people slightly out of date info sometimes if it lets me split 
the query load over two systems".  Set max_standby_delay to a few 
seconds or maybe a minute, enough time to service a typical user query, 
make your app tolerate the occasional failed query and only send big 
ones to the master, and you've just scaled up all the small ones.  
Distributed queries with "eventual consistency" on all nodes is where 
many of the web app designs are going, and this feature is a reasonable 
match for that use case.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Dimitri Fontaine wrote:
> Bruce Momjian  writes:
> > Doesn't the system already adjust the delay based on the length of slave
> > transactions, e.g. max_standby_delay.  It seems there is no need for a
> > user switch --- just max_standby_delay really high.
> 
> Well that GUC looks like it allows to set a compromise between HA and
> reporting, not to say "do not ever give the priority to the replay while
> I'm running my reports". At least that's how I understand it.

Well, if you set it high, it effectively is that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Dimitri Fontaine
Bruce Momjian  writes:
> Doesn't the system already adjust the delay based on the length of slave
> transactions, e.g. max_standby_delay.  It seems there is no need for a
> user switch --- just max_standby_delay really high.

Well that GUC looks like it allows to set a compromise between HA and
reporting, not to say "do not ever give the priority to the replay while
I'm running my reports". At least that's how I understand it.

The feedback loop might get expensive on master server when running
reporting queries on the slave, unless you can "pause" it explicitly I
think. I don't see how the system will guess that you're running a
reporting server rather than a HA node, and max_standby_delay is just a
way to tell the standby to please be nice in case of abuse.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Dimitri Fontaine wrote:
> Tom Lane  writes:
> > Well, as Heikki said, a stop-and-go WAL management approach could deal
> > with that use-case.  What I'm concerned about here is the complexity,
> > reliability, maintainability of trying to interlock WAL application with
> > slave queries in any sort of fine-grained fashion.
> 
> Some admin functions for Hot Standby were removed from the path to ease
> its integration, there was a pause() and resume() feature.
> 
> I think that offering this explicit control to the user would allow them
> to choose between HA setup and reporting setup easily enough: just pause
> the replay when running the reporting, resume it to get fresh data
> again. If you don't pause, any query can get killed, replay is the
> priority.

Doesn't the system already adjust the delay based on the length of slave
transactions, e.g. max_standby_delay.  It seems there is no need for a
user switch --- just max_standby_delay really high.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Tom Lane
Greg Stark  writes:
> On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane  wrote:
>> There's *definitely* not going to be enough information in the WAL
>> stream coming from a master that doesn't think it has HS slaves.
>> We can't afford to record all that extra stuff in installations for
>> which it's just useless overhead.  BTW, has anyone made any attempt
>> to measure the performance hit that the patch in its current form is
>> creating via added WAL entries?

> What extra entries?

Locks, just for starters.  I haven't read enough of the code yet to know
what else Simon added.  In the past it's not been necessary to record
any transient information in WAL, but now we'll have to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Dimitri Fontaine
Tom Lane  writes:
> Well, as Heikki said, a stop-and-go WAL management approach could deal
> with that use-case.  What I'm concerned about here is the complexity,
> reliability, maintainability of trying to interlock WAL application with
> slave queries in any sort of fine-grained fashion.

Some admin functions for Hot Standby were removed from the path to ease
its integration, there was a pause() and resume() feature.

I think that offering this explicit control to the user would allow them
to choose between HA setup and reporting setup easily enough: just pause
the replay when running the reporting, resume it to get fresh data
again. If you don't pause, any query can get killed, replay is the
priority.

Now as far as the feedback loop is concerned, I guess the pause()
function would cause the slave to stop publishing any xmin in the
master's procarray so that it's free to vacuum and archive whatever it
wants to.

Should the slave accumulate too much lag, it will resume from the
archive rather than live from the SR link.

How much that helps?

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
bruce wrote:
>   4 The standby waiting longer than max_standby_delay to acquire a
...
> #4 can be controlled by max_standby_delay, where a large value only
> delays playback during crash recovery --- again, a rare occurance.

One interesting feature is that max_standby_delay will _only_ delay if
it it encounters a conflict when applying WAL, meaning only if there are
long-running transactions.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Tom Lane
Greg Stark  writes:
> Why shouldn't it have any queries at walreceiver startup? It has any
> xlog segments that were copied from the master and any it can find in
> the archive, it could easily reach a consistent point long before it
> needs to connect to the master. If you really want to protect your
> master from any additional overhead you don't currently need to
> configure a streaming connection at all, you can just use the file
> shipping interface.

There's *definitely* not going to be enough information in the WAL
stream coming from a master that doesn't think it has HS slaves.
We can't afford to record all that extra stuff in installations for
which it's just useless overhead.  BTW, has anyone made any attempt
to measure the performance hit that the patch in its current form is
creating via added WAL entries?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Josh Berkus

> Well, as Heikki said, a stop-and-go WAL management approach could deal
> with that use-case.  What I'm concerned about here is the complexity,
> reliability, maintainability of trying to interlock WAL application with
> slave queries in any sort of fine-grained fashion.

This sounds a bit brute-force, but what about simply having some form of
automatic query retry on the slave?

--Josh Berkus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Tom Lane
Greg Stark  writes:
> On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane  wrote:
>> I don't see a "substantial additional burden" there.  What I would
>> imagine is needed is that the slave transmits a single number back
>> --- its current oldest xmin --- and the walsender process publishes
>> that number as its transaction xmin in its PGPROC entry on the master.

> And when we want to support cascading slaves?

So?  Fits right in.  The walsender on the first-level slave is
advertising an xmin from the second-level one, which will be included in
what's passed back up to the master.

> Or when you want to bring up a new slave and it suddenly starts
> advertising a new xmin that's older than the current oldestxmin?

How's it going to do that, when it has no queries at the instant
of startup?

> But in any case if I were running a reporting database I would want it
> to just stop replaying logs for a few hours while my big batch report
> runs, not cause the master to be unable to vacuum any dead records for
> hours. That defeats much of the purpose of running the queries on the
> slave.

Well, as Heikki said, a stop-and-go WAL management approach could deal
with that use-case.  What I'm concerned about here is the complexity,
reliability, maintainability of trying to interlock WAL application with
slave queries in any sort of fine-grained fashion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Josh Berkus
On 2/26/10 6:57 AM, Richard Huxton wrote:
> 
> Can we not wait to cancel the transaction until *any* new lock is
> attempted though? That should protect all the single-statement
> long-running transactions that are already underway. Aggregates etc.

I like this approach.  Is it fragile in some non-obvious way?

--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Bruce Momjian
Heikki Linnakangas wrote:
> > How to handle situations where the standby goes away for a while,
> > such as a network outage, so that it doesn't block the master from ever
> > cleaning up dead tuples is a concern.
> 
> Yeah, that's another issue that needs to be dealt with. You'd probably
> need some kind of a configurable escape valve in the master, to let it
> ignore a standby's snapshot once it gets too old.
> 
> > But I do know that the current Hot Standby implementation is going to be
> > frustrating to configure correctly for people.
> 
> Perhaps others who are not as deep into the code as I am will have a
> better view on this, but I seriously don't think that's such a big
> issue. I think the max_standby_delay setting is quite intuitive and easy
> to explain. Sure, it would better if there was no tradeoff between
> killing queries and stalling recovery, but I don't think it'll be that
> hard to understand the tradeoff.

Let's look at the five documented cases of query conflict (from our manual):

1 Access Exclusive Locks from primary node, including both explicit
  LOCK commands and various DDL actions 

2 Dropping tablespaces on the primary while standby queries are
  using those tablespaces for temporary work files (work_mem
  overflow) 

3 Dropping databases on the primary while users are connected to
  that database on the standby.  

4 The standby waiting longer than max_standby_delay to acquire a
  buffer cleanup lock.  

5 Early cleanup of data still visible to the current query's
  snapshot

We might have a solution to #1 by only cancelling queries that try to
take locks.

#2 and #3 seem like rare occurances.

#4 can be controlled by max_standby_delay, where a large value only
delays playback during crash recovery --- again, a rare occurance.

#5 could be handled by using vacuum_defer_cleanup_age on the master.

Why is vacuum_defer_cleanup_age not listed in postgresql.conf?

In summary, I think passing snapshots to the master is not something
possible for 9.0, and ideally we will never need to add that feature.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Robert Haas
On Fri, Feb 26, 2010 at 10:21 AM, Heikki Linnakangas
 wrote:
> Richard Huxton wrote:
>> Can we not wait to cancel the transaction until *any* new lock is
>> attempted though? That should protect all the single-statement
>> long-running transactions that are already underway. Aggregates etc.
>
> Hmm, that's an interesting thought. You'll still need to somehow tell
> the victim backend "you have to fail if you try to acquire any more
> locks", but a single per-backend flag in the procarray would suffice.
>
> You could also clear the flag whenever you free the last snapshot in the
> transaction (ie. between each query in read committed mode).

Wow, that seems like it would help a lot.  Although I'm not 100% sure
I follow all the details of how this works.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Richard Huxton

Replying to my own post - first sign of madness...

Let's see if I've got the concepts clear here, and hopefully my thinking 
it through will help others reading the archives.


There are two queues:
1. Cleanup on the master
2. Replay on the slave

Running write queries on the master adds to both queues.
Running (read-only) queries on the slave prevents you removing from both 
queues.



There are two interesting measurements of "age"/"size":
1. Oldest item in / length of queue (knowable)
2. How long will it take to clear the queue (estimable at best)

You'd like to know #2 to keep up with your workload. Unfortunately, you 
can't for certain unless you have control over new incoming queries (on 
both master and slave).


You might want four separate GUCs for the two measurements on the two 
queues. We currently have two that (sort of) match #1 "Oldest item" 
(vacuum_defer_cleanup_age, max_standby_delay).



Delaying replay on a slave has no effect on the master. If a slave falls 
too far behind it's responsible for catch-up (via normal WAL archives).


There is no point in delaying cleanup on the master unless it's going to 
help one or more slaves. In fact, you don't want to start delaying 
cleanup until you have to, otherwise you're wasting your delay time. 
This seems to be the case with vacuum_defer_cleanup_age. If I have a 
heavily-updated table and I defer vacuuming then when any given query 
starts on the slave it's going to be half used up already.


There's also no point in deferring cleanup on the master if the standby 
is already waiting on a conflict that will cause its queries to be 
cancelled anyway. Not only won't it help, but it might make things worse 
since transactions will be cancelled, the conflict will be replayed and 
(presumably) queries will be re-submitted only to be cancelled again.


This is what Greg Smith's discussion of the keep-alives was about. 
Giving the master enough information to be smarter about cleanup (and 
making the conflicts more fine-grained).


The situation with deferring on one or both ends of process just gets 
more complicated with multiple slaves. There's all sorts of unpleasant 
feedback loops I can envisage there.


For the case of single slave being used to run long reporting queries 
the ideal scenario would be the following. Master starts deferring 
vacuum activity just before the query starts. When that times out, the 
slave will receive the cleanup info, refuse to replay it and start its 
delay. This gives you a total available query time of:
 natural time between vacuums + vacuum delay + WAL transfer time + 
standby delay



I can think of five useful things we should be doing (and might be 
already - don't know).


1. On the master, deduce whether the slave is already waiting on a 
query. If so, don't bother delaying cleanup. Clearly you don't want to 
be signalling hundreds of times a second though. Does the slave pause 
fetching via streaming replication if replay is blocked on a query? 
Could we signal "half-way to max-age" or some such?


2. Perhaps simpler than trying to make the master smarter, just allow 
SET this_transaction_is_probably_a_long_one=true on the slave. That (a) 
clears the queue on the slave and (b) sends the signal to the master 
which then starts deferring vacuum.


3. Do a burst of cleanup activity on the master after blocking. This 
should concentrate conflicts together when they reach the slave. Perhaps 
vacuum_defer_cleanup_age should be vacuum_deferred_queue_size and 
measure the amount of work to do, rather than the max age of the oldest 
cleanup (if I've understood correctly).


4. Do a burst of replay on the slave after blocking. Perhaps every time 
it cancels a transaction it should replay at least half the queued WAL 
before letting new transactions start. Or perhaps it replays any vacuum 
activity it comes across and then stops. That should sync with #2 
assuming the slave doesn't lag the master too much.


5. I've been mixing "defer" and "delay", as do the docs. We should 
probably settle on one or the other. I think defer conveys the meaning 
more precisely, but what about non-native English speakers?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Heikki Linnakangas
Richard Huxton wrote:
> Can we not wait to cancel the transaction until *any* new lock is
> attempted though? That should protect all the single-statement
> long-running transactions that are already underway. Aggregates etc.

Hmm, that's an interesting thought. You'll still need to somehow tell
the victim backend "you have to fail if you try to acquire any more
locks", but a single per-backend flag in the procarray would suffice.

You could also clear the flag whenever you free the last snapshot in the
transaction (ie. between each query in read committed mode).

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Richard Huxton

On 26/02/10 14:45, Heikki Linnakangas wrote:

Richard Huxton wrote:

On 26/02/10 08:33, Greg Smith wrote:

I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:


Hmm - this I'd already figured out for myself. It's just occurred to me
that this could well be the case between databases too. Database A gets
vacuumed, B gets its queries kicked off on the standby.


No, it's per-database already. Only queries in the same database are
canceled.


That's a relief.


Dumb non-hacker question: why do we cancel all transactions rather than
just those with "ACCESS SHARE" on the vacuumed table in question? Is it
the simple fact that we don't know what table this particular section of
WAL affects, or is it the complexity of tracking all this info?


The problem is that even if transaction X doesn't have an (access share)
lock on the vacuumed table at the moment, it might take one in the
future. Simon proposed mechanisms for storing the information about
vacuumed tables in shared memory, so that if X takes the lock later on
it will get canceled at that point, but that's 9.1 material.


I see - we'd need to age the list of vacuumed tables too, so when the 
oldest transactions complete the correct flags get cleared.


Can we not wait to cancel the transaction until *any* new lock is 
attempted though? That should protect all the single-statement 
long-running transactions that are already underway. Aggregates etc.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Heikki Linnakangas
Richard Huxton wrote:
> On 26/02/10 08:33, Greg Smith wrote:
>> I'm not sure what you might be expecting from the above combination, but
>> what actually happens is that many of the SELECT statements on the table
>> *that isn't even being updated* are canceled. You see this in the logs:
> 
> Hmm - this I'd already figured out for myself. It's just occurred to me
> that this could well be the case between databases too. Database A gets
> vacuumed, B gets its queries kicked off on the standby.

No, it's per-database already. Only queries in the same database are
canceled.

> Dumb non-hacker question: why do we cancel all transactions rather than
> just those with "ACCESS SHARE" on the vacuumed table in question? Is it
> the simple fact that we don't know what table this particular section of
> WAL affects, or is it the complexity of tracking all this info?

The problem is that even if transaction X doesn't have an (access share)
lock on the vacuumed table at the moment, it might take one in the
future. Simon proposed mechanisms for storing the information about
vacuumed tables in shared memory, so that if X takes the lock later on
it will get canceled at that point, but that's 9.1 material.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Richard Huxton

On 26/02/10 14:10, Heikki Linnakangas wrote:


Ideally the standby would stash away the old pages or tuples somewhere
so that it can still access them even after replaying the WAL records
that remove them from the main storage. I realize that's not going to
happen any time soon because it's hard to do, but that would really be
the most robust fix possible.


Something like snapshotting a filesystem, so updates continue while 
you're still looking at a static version.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Richard Huxton

On 26/02/10 08:33, Greg Smith wrote:

 There are a number of HS
tunables that interact with one another, and depending your priorities a
few ways you can try to optimize the configuration for what I expect to
be common use cases for this feature.


> I've written a blog entry at

http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html
that tries to explain all that background clearly,


It did too. Thanks for the nice summary people can be pointed at.


I'm not sure what you might be expecting from the above combination, but
what actually happens is that many of the SELECT statements on the table
*that isn't even being updated* are canceled. You see this in the logs:


Hmm - this I'd already figured out for myself. It's just occurred to me 
that this could well be the case between databases too. Database A gets 
vacuumed, B gets its queries kicked off on the standby. Granted lots of 
people just have the one main DB, but even so...



LOG: restored log file "000100A5" from archive
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: SELECT sum(abalance) FROM pgbench_accounts;

Basically, every time a WAL segment appears that wipes out a tuple that
SELECT expects should still be visible, because the dead row left behind
by the update has been vacuumed away, the query is canceled. This
happens all the time the way I've set this up, and I don't feel like
this is a contrived demo. Having a long-running query on the standby
while things get updated and then periodically autovacuumed on the
primary is going to be extremely common in the sorts of production
systems I expect want HS the most.


I can pretty much everyone wanting HS+SR. Thousands of small DBs running 
on VMs for a start. Free mostly-live backup? Got to be a winner.


Dumb non-hacker question: why do we cancel all transactions rather than 
just those with "ACCESS SHARE" on the vacuumed table in question? Is it 
the simple fact that we don't know what table this particular section of 
WAL affects, or is it the complexity of tracking all this info?



If you're running a system that also is using Streaming Replication,
there is a much better approach possible.



"Requires keep-alives with timestamps to be added to sync rep feature"

If those keep-alives flowed in both directions, and included both
timestamps *and* xid visibility information, the master could easily be
configured to hold open xid snapshots needed for long running queries on
the standby when that was necessary.


Presumably meaning we need *another* config setting to prevent excessive 
bloat on a heavily updated table on the master.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration

2010-02-26 Thread Heikki Linnakangas
Greg Smith wrote:
> Attached is a tar file with some test case demo scripts that demonstrate
> the worst of the problems here IMHO.

Thanks for that! We've been discussing this for ages, so it's nice to
have a concrete example.

> I don't want to belittle that work because it's been important to make
> HS a useful standalone feature, but I feel like that's all triage rather
> than looking for the most robust fix possible.

Ideally the standby would stash away the old pages or tuples somewhere
so that it can still access them even after replaying the WAL records
that remove them from the main storage. I realize that's not going to
happen any time soon because it's hard to do, but that would really be
the most robust fix possible.

> I don't know how difficult the keepalive feature was expected to be, and
> there's certainly plenty of potential landmines in this whole xid export
> idea.

One such landmine is that the keepalives need to flow from client to
server while the WAL records are flowing from server to client. We'll
have to crack that problem for synchronous replication too, but I think
that alone is a big enough problem to make this 9.1 material.

> How to handle situations where the standby goes away for a while,
> such as a network outage, so that it doesn't block the master from ever
> cleaning up dead tuples is a concern.

Yeah, that's another issue that needs to be dealt with. You'd probably
need some kind of a configurable escape valve in the master, to let it
ignore a standby's snapshot once it gets too old.

> But I do know that the current Hot Standby implementation is going to be
> frustrating to configure correctly for people.

Perhaps others who are not as deep into the code as I am will have a
better view on this, but I seriously don't think that's such a big
issue. I think the max_standby_delay setting is quite intuitive and easy
to explain. Sure, it would better if there was no tradeoff between
killing queries and stalling recovery, but I don't think it'll be that
hard to understand the tradeoff.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-08 Thread Bruce Momjian
Fujii Masao wrote:
> On Mon, Feb 8, 2010 at 10:34 PM, Bruce Momjian  wrote:
> > Ahh, good point. ?I had not considered the table would change. ?What I
> > did was to mark "Slaves accept read-only queries" as "Hot only".
> 
> Can the "warm standby" still reside in v9.0? If not, the mark of
> "Hot only" seems odd for me.

Yes, both hot and warm standby is supported in 9.0.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-08 Thread Fujii Masao
On Mon, Feb 8, 2010 at 10:34 PM, Bruce Momjian  wrote:
> Ahh, good point.  I had not considered the table would change.  What I
> did was to mark "Slaves accept read-only queries" as "Hot only".

Can the "warm standby" still reside in v9.0? If not, the mark of
"Hot only" seems odd for me.

> I did not change "Master failure will never lose data" because the 9.0
> streaming implementation is not sychronous (see wal_sender_delay in
> postgresql.conf), and I don't think even setting that to zero makes the
> operation synchronous.  I think we will have to wait for PG 9.1 for
> _synchronous_ streaming replication.

You are right.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-08 Thread Bruce Momjian
Markus Wanner wrote:
> Bruce,
> 
> Bruce Momjian wrote:
> > Ah, I now realize it only mentions "warm" standby, not "hot", so I just
> > updated the documentation to reflect that;  you can see it here:
> 
> Maybe the table below also needs an update, because unlike "Warm Standby 
> using PITR", a hot standby accepts read-only queries and can be 
> configured to not loose data on master failure.

Ahh, good point.  I had not considered the table would change.  What I
did was to mark "Slaves accept read-only queries" as "Hot only".  You
can see the result here:

http://momjian.us/tmp/pgsql/high-availability.html

I did not change "Master failure will never lose data" because the 9.0
streaming implementation is not sychronous (see wal_sender_delay in
postgresql.conf), and I don't think even setting that to zero makes the
operation synchronous.  I think we will have to wait for PG 9.1 for
_synchronous_ streaming replication.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-07 Thread David E. Wheeler
On Feb 7, 2010, at 12:35 PM, Josh Berkus wrote:

>> I've always thought this feature was misnamed and nothing has happened
>> to change my mind, but it's not clear whether I'm in the majority.
> 
> I'm afraid force of habit is more powerful than correctness on this one.
> It's going to be HS/SR whether that's perfectly correct or not.

What would be correct? I thought HS/SR were pretty correct (as long as no one 
confuses SR with synchronous replication!).

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-07 Thread Josh Berkus

> I've always thought this feature was misnamed and nothing has happened
> to change my mind, but it's not clear whether I'm in the majority.

I'm afraid force of habit is more powerful than correctness on this one.
 It's going to be HS/SR whether that's perfectly correct or not.

--Josh Berkus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and DROP DATABASE

2010-02-07 Thread Simon Riggs
On Sat, 2010-02-06 at 17:32 +0100, Andres Freund wrote:
> > 
> > So it seems at least the behavior is quite different from what the
> > docs stats. Am I missing something here?
> Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase
> 
> The line:
>   CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, 
> true);
> 
> has to be
>   CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_DATABASE, 
> true);

Well spotted, thanks for the report and the analysis.

The code for drop database worked when committed but it looks like the
re-factoring of the code broke it. Will fix.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-07 Thread Robert Haas
On Sun, Feb 7, 2010 at 4:41 AM, Markus Wanner  wrote:
> Bruce Momjian wrote:
>> Do we want to call the feature "hot standby"?  Is a read-only standby a
>> "standby" or a "slave"?
>
> I think hot standby is pretty much the term, now.

See here for the previous iteration of this discussion:

http://archives.postgresql.org/pgsql-hackers/2009-08/msg00870.php

I've always thought this feature was misnamed and nothing has happened
to change my mind, but it's not clear whether I'm in the majority.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-07 Thread Markus Wanner

Bruce,

Bruce Momjian wrote:

Ah, I now realize it only mentions "warm" standby, not "hot", so I just
updated the documentation to reflect that;  you can see it here:


Maybe the table below also needs an update, because unlike "Warm Standby 
using PITR", a hot standby accepts read-only queries and can be 
configured to not loose data on master failure.



Do we want to call the feature "hot standby"?  Is a read-only standby a
"standby" or a "slave"?


I think hot standby is pretty much the term, now.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and DROP DATABASE

2010-02-06 Thread Andres Freund
On Saturday 06 February 2010 17:32:43 Andres Freund wrote:
> On Saturday 06 February 2010 02:25:33 Tatsuo Ishii wrote:
> > Hi,
> > 
> > While testing Hot Standby, I have encountered strange behavior with
> > DROP DATABASE command.
> > 
> > 1) connect to "test" database at standby via psql
> > 2) issue DROP DATABASE test command to primary
> > 3) session #1 works fine
> > 4) close session #1
> > 5) "test" database dropped on standby
> > 
> > Fromt the manual:
> >  Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER
> >  DATABASE ... RENAME on primary will generate a log message that will
> >  cause all users connected to that database on the standby to be
> >  forcibly disconnected. This action occurs immediately, whatever the
> >  setting of max_standby_delay.
> > 
> > So it seems at least the behavior is quite different from what the
> > docs stats. Am I missing something here?
> 
> Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase
> 
> The line:
>   CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, 
> true);
For the case it should not be clear, the reason that  
PROCSIG_RECOVERY_CONFLICT_TABLESPACE did not kill the session is that 
currently all tablespace conflicts are valid only inside a transaction, so when 
receiving the recovery conflict checks whether its not inside a transaction 
block "anymore" and continues happily.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and DROP DATABASE

2010-02-06 Thread Andres Freund
On Saturday 06 February 2010 02:25:33 Tatsuo Ishii wrote:
> Hi,
> 
> While testing Hot Standby, I have encountered strange behavior with
> DROP DATABASE command.
> 
> 1) connect to "test" database at standby via psql
> 2) issue DROP DATABASE test command to primary
> 3) session #1 works fine
> 4) close session #1
> 5) "test" database dropped on standby
> 
> Fromt the manual:
> 
>  Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER
>  DATABASE ... RENAME on primary will generate a log message that will
>  cause all users connected to that database on the standby to be
>  forcibly disconnected. This action occurs immediately, whatever the
>  setting of max_standby_delay.
> 
> So it seems at least the behavior is quite different from what the
> docs stats. Am I missing something here?
Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase

The line:
CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, 
true);

has to be
CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_DATABASE, 
true);


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby documentation

2010-02-05 Thread Bruce Momjian
Joshua Tolley wrote:
-- Start of PGP signed section.
> Having concluded I really need to start playing with hot standby, I started
> looking for documentation on the subject. I found what I was looking for; I
> also found this page[1], which, it seems, ought to mention hot standby.
> Comments?
> 
> [1] http://developer.postgresql.org/pgdocs/postgres/high-availability.html

Ah, I now realize it only mentions "warm" standby, not "hot", so I just
updated the documentation to reflect that;  you can see it here:

http://momjian.us/tmp/pgsql/high-availability.html

Warm and Hot Standby Using Point-In-Time Recovery (PITR)

Do we want to call the feature "hot standby"?  Is a read-only standby a
"standby" or a "slave"?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Tom Lane
Simon Riggs  writes:
> On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote:
>> I've concluded that that's too large a change to undertake for 9.0

> The purpose of this was to make the big changes in 9.0. If we aren't
> going to do that it seems like we shouldn't bother at all.

No, the purpose of this was to get rid of VACUUM FULL INPLACE in 9.0.
I'm not interested in destabilizing the code (even more) just to avoid
one small internal kluge.  The proposed magic field in struct Relation
is the only part of this that I'd foresee reverting later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Simon Riggs
On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote:

> I've concluded that that's too large a change to undertake for 9.0

The purpose of this was to make the big changes in 9.0. If we aren't
going to do that it seems like we shouldn't bother at all.

So why not flip back to the easier approach of make something work for
HS only and then do everything you want to do in the next release? The
burst radius of the half-way changes you are proposing seems high in
comparison.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Bruce Momjian
Tom Lane wrote:
> I've been playing around with different alternatives for solving the
> problem of toast-pointer OIDs, but I keep coming back to the above as
> being the least invasive and most robust answer.  There are two basic
> ways that we could do it: pass the OID to use to the toast logic, which
> would require adding a parameter to heap_insert and a number of other
> places; or add a field to struct Relation that says "when inserting a
> TOAST pointer in this relation, use this OID as the toast-table OID
> value in the pointer, even if that's different from what the table's OID
> appears to be".  The latter seems like less of a notational change, so
> I'm leaning to that, but wanted to see if anyone prefers the other.
> 
> We could avoid this hackery if there were a way for Relation structs to
> point at either the old or the new physical relation (relfilenode);
> then we'd not need the transient "new heap" relation during CLUSTER/VF,
> which would be good for reducing catalog churn.  I've concluded that
> that's too large a change to undertake for 9.0, but it might be
> interesting to try in the future.  So I'd prefer that what we do for
> now touch as little code as possible so as to be easy to revert; hence
> I'm not wanting to change heap_insert's signature.

I don't think any of this affects pg_migrator, but if it does, please
let me know.  When I hear TOAST and OID used in the same sentence, my
ears perk up.  :-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Tom Lane
I wrote:
> * We can not change the toast rel OID of a shared catalog -- there's no
> way to propagate that into the other copies of pg_class.  So we need to
> rejigger the logic for heap rewriting a little bit.  Toast rel swapping
> has to be handled by swapping their relfilenodes not their OIDs.  This
> is no big deal as far as cluster.c itself is concerned, but the tricky
> part is that when we write new toasted values into the new toast rel,
> the TOAST pointers going into the new heap have to be written with the
> original toast-table OID value not the one that the transient target
> toast rel has got.  This is doable but it would uglify the TOAST API a
> bit I think.

I've been playing around with different alternatives for solving the
problem of toast-pointer OIDs, but I keep coming back to the above as
being the least invasive and most robust answer.  There are two basic
ways that we could do it: pass the OID to use to the toast logic, which
would require adding a parameter to heap_insert and a number of other
places; or add a field to struct Relation that says "when inserting a
TOAST pointer in this relation, use this OID as the toast-table OID
value in the pointer, even if that's different from what the table's OID
appears to be".  The latter seems like less of a notational change, so
I'm leaning to that, but wanted to see if anyone prefers the other.

We could avoid this hackery if there were a way for Relation structs to
point at either the old or the new physical relation (relfilenode);
then we'd not need the transient "new heap" relation during CLUSTER/VF,
which would be good for reducing catalog churn.  I've concluded that
that's too large a change to undertake for 9.0, but it might be
interesting to try in the future.  So I'd prefer that what we do for
now touch as little code as possible so as to be easy to revert; hence
I'm not wanting to change heap_insert's signature.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution

2010-02-03 Thread Simon Riggs
On Tue, 2010-02-02 at 20:27 +0200, Heikki Linnakangas wrote:

> > I'd appreciate it if you could review the relation-specific conflict
> > patch, 'cos it's still important.
> 
> One fundamental gripe I have about that approach is that it's hard to
> predict when you will be saved by the cache and when your query will be
> canceled. For example, the patch stores only one "latestRemovedXid"
> value per lock partition. So if you have two tables that hash to
> different lock partitions, and are never both accessed in a single
> transaction, the cache will save your query every time. So far so good,
> but then you do a dump/restore, and the tables happen to be assigned to
> the same lock partition. Oops, a system that used to work fine starts to
> get "snapshot too old" errors.
> 
> It's often better to be consistent and predictable, even if it means
> cancelling more queries. I think wë́'d need to have a much more
> fine-grained system before it's worthwhile to do deferred resolution.
> There's just too much "false sharing" otherwise.

ISTM that this is exactly backwards. There is already way too many false
positives and this patch would reduce them very significantly. Plus the
cancelation is hardly predictable since it relies on whether or not a
btree delete takes place during execution and the arrival time and rate
of those is sporadic. There is no safe, predicatable behaviour in the
current code.

The gripe about the cache cannot be a fundamental one, since we can
easily change the size and mechanism by which the cache operates without
changing the patch very much at all.

I am being told this area is a must-fix issue for this release. Tom's
reaction to this issue (on other thread) illustrates that beautifully:

On Sun, 2010-01-31 at 15:41 -0500, Tom Lane wrote: 
> Simon Riggs  writes:
(snip) 
> > 2. no matter if they haven't accessed the index being cleaned (they
> > might later, is the thinking...)
> 
> That seems seriously horrid.  What is the rationale for #2 in
> particular?  I would hope that at worst this would affect sessions
> that are actively competing for the index being cleaned.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution

2010-02-02 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Fri, 2010-01-29 at 15:01 +, Simon Riggs wrote:
> 
>> Putting it back takes time and
>> given enough of that rare cloth, it will eventually be put back.
> 
> Looks like I'll have time to add the starts-at-shutdown-checkpoint item
> back in after all.

Great! Thank you, much appreciated.

> I'd appreciate it if you could review the relation-specific conflict
> patch, 'cos it's still important.

One fundamental gripe I have about that approach is that it's hard to
predict when you will be saved by the cache and when your query will be
canceled. For example, the patch stores only one "latestRemovedXid"
value per lock partition. So if you have two tables that hash to
different lock partitions, and are never both accessed in a single
transaction, the cache will save your query every time. So far so good,
but then you do a dump/restore, and the tables happen to be assigned to
the same lock partition. Oops, a system that used to work fine starts to
get "snapshot too old" errors.

It's often better to be consistent and predictable, even if it means
cancelling more queries. I think wë́'d need to have a much more
fine-grained system before it's worthwhile to do deferred resolution.
There's just too much "false sharing" otherwise.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
I wrote:
> The design I sketched doesn't require such an assumption anyway.  Once
> the map file is written, the relocation is effective, commit or no.
> As long as we restrict relocations to maintenance operations such as
> VACUUM FULL, which have no transactionally significant results, this
> doesn't seem like a problem.  What we do need is that after a CLUSTER
> or V.F., which is going to relocate not only the rel but its indexes,
> the relocations of the rel and its indexes have to all "commit"
> atomically.  But saving up the transaction's map changes and applying
> them in one write takes care of that.

BTW, I noticed a couple of other issues that need to be dealt with to
make that safe.  During CLUSTER/V.F. we typically try to update the
relation's relfilenode, relpages, reltuples, relfrozenxid (in
setNewRelfilenode) as well as its toastrelid (in swap_relation_files).
These are regular transactional updates to the pg_class tuple that will
fail to commit if the outer transaction rolls back.  However:

* For a mapped relation, both the old and new relfilenode will be zero,
so it doesn't matter.

* Possibly losing the updates of relpages and reltuples is not critical.

* For relfrozenxid, we can simply force the new and old values to be the
same rather than hoping to advance the value, if we're dealing with a
mapped relation.  Or just let it be; I think that losing an advance
of relfrozenxid wouldn't be critical either.

* We can not change the toast rel OID of a shared catalog -- there's no
way to propagate that into the other copies of pg_class.  So we need to
rejigger the logic for heap rewriting a little bit.  Toast rel swapping
has to be handled by swapping their relfilenodes not their OIDs.  This
is no big deal as far as cluster.c itself is concerned, but the tricky
part is that when we write new toasted values into the new toast rel,
the TOAST pointers going into the new heap have to be written with the
original toast-table OID value not the one that the transient target
toast rel has got.  This is doable but it would uglify the TOAST API a
bit I think.  Another possibility is to treat the toast rel OID of a
catalog as something that can be supplied by the map file.  Not sure
which way to jump.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution

2010-02-01 Thread Simon Riggs
On Fri, 2010-01-29 at 15:01 +, Simon Riggs wrote:

> Putting it back takes time and
> given enough of that rare cloth, it will eventually be put back.

Looks like I'll have time to add the starts-at-shutdown-checkpoint item
back in after all.

I'd appreciate it if you could review the relation-specific conflict
patch, 'cos it's still important.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and deadlock detection

2010-02-01 Thread Simon Riggs
On Mon, 2010-02-01 at 17:50 +0200, Heikki Linnakangas wrote:

> Umm, so why not run the deadlock check immediately in
> max_standby_delay=-1 case as well? Why is that case handled differently
> from max_standby_delay>0 case?

Done, tested, working.

Will commit tomorrow if no further questions or comments.

-- 
 Simon Riggs   www.2ndQuadrant.com
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***
*** 272,277  procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 272,280 
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
  
+ 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
+ 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+ 
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
  
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***
*** 127,132  WaitExceedsMaxStandbyDelay(void)
--- 127,135 
  	long	delay_secs;
  	int		delay_usecs;
  
+ 	if (MaxStandbyDelay == -1)
+ 		return false;
+ 
  	/* Are we past max_standby_delay? */
  	TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(),
  		&delay_secs, &delay_usecs);
***
*** 351,365  ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
   * they hold one of the buffer pins that is blocking Startup process. If so,
   * backends will take an appropriate error action, ERROR or FATAL.
   *
!  * A secondary purpose of this is to avoid deadlocks that might occur between
!  * the Startup process and lock waiters. Deadlocks occur because if queries
   * wait on a lock, that must be behind an AccessExclusiveLock, which can only
!  * be clared if the Startup process replays a transaction completion record.
!  * If Startup process is waiting then that is a deadlock. If we allowed a
!  * setting of max_standby_delay that meant "wait forever" we would then need
!  * special code to protect against deadlock. Such deadlocks are rare, so the
!  * code would be almost certainly buggy, so we avoid both long waits and
!  * deadlocks using the same mechanism.
   */
  void
  ResolveRecoveryConflictWithBufferPin(void)
--- 354,368 
   * they hold one of the buffer pins that is blocking Startup process. If so,
   * backends will take an appropriate error action, ERROR or FATAL.
   *
!  * We also check for deadlocks before we wait, though applications that cause
!  * these will be extremely rare.  Deadlocks occur because if queries
   * wait on a lock, that must be behind an AccessExclusiveLock, which can only
!  * be cleared if the Startup process replays a transaction completion record.
!  * If Startup process is also waiting then that is a deadlock. The deadlock
!  * can occur if the query is waiting and then the Startup sleeps, or if
!  * Startup is sleeping and the the query waits on a lock. We protect against
!  * only the former sequence here, the latter sequence is checked prior to
!  * the query sleeping, in CheckRecoveryConflictDeadlock().
   */
  void
  ResolveRecoveryConflictWithBufferPin(void)
***
*** 368,378  ResolveRecoveryConflictWithBufferPin(void)
  
  	Assert(InHotStandby);
  
- 	/*
- 	 * Signal immediately or set alarm for later.
- 	 */
  	if (MaxStandbyDelay == 0)
! 		SendRecoveryConflictWithBufferPin();
  	else
  	{
  		TimestampTz now;
--- 371,393 
  
  	Assert(InHotStandby);
  
  	if (MaxStandbyDelay == 0)
! 	{
! 		/*
! 		 * We don't want to wait, so just tell everybody holding the pin to 
! 		 * get out of town.
! 		 */
! 		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
! 	}
! 	else if (MaxStandbyDelay == -1)
! 	{
! 		/*
! 		 * Send out a request to check for buffer pin deadlocks before we wait.
! 		 * This is fairly cheap, so no need to wait for deadlock timeout before
! 		 * trying to send it out.
! 		 */
! 		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
! 	}
  	else
  	{
  		TimestampTz now;
***
*** 386,392  ResolveRecoveryConflictWithBufferPin(void)
  			&standby_delay_secs, &standby_delay_usecs);
  
  		if (standby_delay_secs >= MaxStandbyDelay)
! 			SendRecoveryConflictWithBufferPin();
  		else
  		{
  			TimestampTz fin_time;			/* Expected wake-up time by timer */
--- 401,412 
  			&standby_delay_secs, &standby_delay_usecs);
  
  		if (standby_delay_secs >= MaxStandbyDelay)
! 		{
! 			/*
! 			 * We're already behind, so clear a path as quickly as possible.
! 			 */
! 			SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
! 		}
  		else
  		{
  			TimestampTz fin_time;			/* Expected wake-up time by timer */
***
*** 394,399  ResolveRecoveryConflictWithBufferPin(void)
--- 414,426 
  			int		timer_delay_usecs = 0;
  
  			/*
+ 			 * Send out a request to check for buffer pin deadlocks b

Re: [HACKERS] Hot Standby and deadlock detection

2010-02-01 Thread Simon Riggs
On Mon, 2010-02-01 at 17:50 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Mon, 2010-02-01 at 09:40 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> The way this would work is if Startup waits on a buffer pin we
> >>> immediately send out a request to all backends to cancel themselves if
> >>> they are holding the buffer pin required && waiting on a lock. We then
> >>> sleep until max_standby_delay. When max_standby_delay = -1 we only sleep
> >>> until deadlock timeout and then check (on the Startup process).
> >> Should wake up to check for deadlocks after deadlock_timeout also when
> >> max_standby_delay > deadlock_timeout. max_standby_delay could be hours -
> >> we want to detect a deadlock sooner than that.
> > 
> > The patch does detect deadlocks sooner that that - "immediately", as
> > described above.
> 
> Umm, so why not run the deadlock check immediately in
> max_standby_delay=-1 case as well? Why is that case handled differently
> from max_standby_delay>0 case?

Cos the code to do that is easy.

I'll do the deadlock check immediately and make it even easier.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Simon Riggs
On Mon, 2010-02-01 at 10:27 -0500, Tom Lane wrote:
> Simon Riggs  writes:
> > On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:
> >> the assumption that the file is less than one disk block,
> >> it should be just as atomic as pg_control updates are.
> 
> > IIRC there were 173 relations affected by this. 4 bytes each we would
> > have more than 512 bytes.
> 
> Where in the world did you get that number?
> 
> There are currently 29 shared relations (counting indexes), and 13
> nailed local relations, which would go into a different map file.
> I'm not sure if the set of local catalogs requiring the map treatment
> is exactly the same as what's presently nailed, but that's probably
> a good approximation.

I was suggesting that we only do shared and nailed relations. Sounds
like you agree.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and deadlock detection

2010-02-01 Thread Heikki Linnakangas
Simon Riggs wrote:
> On Mon, 2010-02-01 at 09:40 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> The way this would work is if Startup waits on a buffer pin we
>>> immediately send out a request to all backends to cancel themselves if
>>> they are holding the buffer pin required && waiting on a lock. We then
>>> sleep until max_standby_delay. When max_standby_delay = -1 we only sleep
>>> until deadlock timeout and then check (on the Startup process).
>> Should wake up to check for deadlocks after deadlock_timeout also when
>> max_standby_delay > deadlock_timeout. max_standby_delay could be hours -
>> we want to detect a deadlock sooner than that.
> 
> The patch does detect deadlocks sooner that that - "immediately", as
> described above.

Umm, so why not run the deadlock check immediately in
max_standby_delay=-1 case as well? Why is that case handled differently
from max_standby_delay>0 case?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
Heikki Linnakangas  writes:
> Tom Lane wrote:
>> The other problem with what you sketch is that it'd require holding the
>> mapfile write lock across commit, because we still have to have strict
>> serialization of updates.

> Why is the strict serialization of updates needed? To avoid overwriting
> the file with stale contents in a race condition?

Exactly.

> I was thinking that we only store the modified part in the WAL record.
> Right after writing commit record, take the lock, read() the map file,
> modify it in memory, write() it back, and release lock.

> That means that there's no full images of the file in WAL records, which
> makes me slightly uncomfortable from a disaster recovery point-of-view,

Yeah, me too, which is another advantage of using a separate WAL entry.

> That doesn't solve the problem I was trying to solve, which is that if
> the map file is rewritten, but the transaction later aborts, the map
> file update has hit the disk already. That's why I wanted to stash it
> into the commit record.

The design I sketched doesn't require such an assumption anyway.  Once
the map file is written, the relocation is effective, commit or no.
As long as we restrict relocations to maintenance operations such as
VACUUM FULL, which have no transactionally significant results, this
doesn't seem like a problem.  What we do need is that after a CLUSTER
or V.F., which is going to relocate not only the rel but its indexes,
the relocations of the rel and its indexes have to all "commit"
atomically.  But saving up the transaction's map changes and applying
them in one write takes care of that.

(What I believe this means is that pg_class and its indexes have to all
be mapped, but I'm thinking right now that no other non-shared catalogs
need the treatment.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Heikki Linnakangas
Tom Lane wrote:
> That seems too fragile to me, as I don't find it a stretch at all to
> think that writing the map file might fail --- just think Windows
> antivirus code :-(.  Now, once we have written the WAL record for
> the mapfile change, we can't really afford a failure in my approach
> either.  But I think a rename() after successfully creating/writing/
> fsync'ing a temp file is a whole lot safer than writing from a standing
> start.

My gut feeling is exactly opposite. Creating and renaming a file
involves operations (and permissions) on the directory, while
overwriting a small file is just a simple write(). Especially if you
open() the file before doing anything irreversible.

> The other problem with what you sketch is that it'd require holding the
> mapfile write lock across commit, because we still have to have strict
> serialization of updates.

Why is the strict serialization of updates needed? To avoid overwriting
the file with stale contents in a race condition?

I was thinking that we only store the modified part in the WAL record.
Right after writing commit record, take the lock, read() the map file,
modify it in memory, write() it back, and release lock.

That means that there's no full images of the file in WAL records, which
makes me slightly uncomfortable from a disaster recovery point-of-view,
but we could keep a backup copy of the file in the data directory or
something if that's too scary otherwise.

> Maybe we should forget the
> rename() trick and overwrite the map file in place.  I still think it
> needs to be a separate WAL record though.  I'm thinking
> 
>   * obtain lock
>   * open file for read/write
>   * read current contents
>   * construct modified contents
>   * write and sync WAL record
>   * write back file through already-opened descriptor
>   * fsync
>   * release lock
> 
> Not totally clear if this is more or less safe than the rename method;
> but given the assumption that the file is less than one disk block,
> it should be just as atomic as pg_control updates are.

That doesn't solve the problem I was trying to solve, which is that if
the map file is rewritten, but the transaction later aborts, the map
file update has hit the disk already. That's why I wanted to stash it
into the commit record.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
Simon Riggs  writes:
> On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:
>> the assumption that the file is less than one disk block,
>> it should be just as atomic as pg_control updates are.

> IIRC there were 173 relations affected by this. 4 bytes each we would
> have more than 512 bytes.

Where in the world did you get that number?

There are currently 29 shared relations (counting indexes), and 13
nailed local relations, which would go into a different map file.
I'm not sure if the set of local catalogs requiring the map treatment
is exactly the same as what's presently nailed, but that's probably
a good approximation.

At 8 bytes each (OID + relfilenode), we could fit 64 map entries in
a standard disk sector --- let's say 63 so there's room for a header.
So, barring more-than-doubling of the number of shared catalogs,
there's not going to be a problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   9   10   >