Re: [HACKERS] Hot Standby b-tree delete records review
On Tue, 2010-11-09 at 13:34 +0200, Heikki Linnakangas wrote: > (cleaning up my inbox, and bumped into this..) > > On 22.04.2010 12:31, Simon Riggs wrote: > > On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote: > >>> > If none of the removed heap tuples were present anymore, we currently > return InvalidTransactionId, which kills/waits out all read-only > queries. But if none of the tuples were present anymore, the > read-only > queries wouldn't have seen them anyway, so ISTM that we should treat > InvalidTransactionId return value as "we don't need to kill anyone". > >>> That's not the point. The tuples were not themselves the sole focus, > >> Yes, they were. We're replaying a b-tree deletion record, which removes > >> pointers to some heap tuples, making them unreachable to any read-only > >> queries. If any of them still need to be visible to read-only queries, > >> we have a conflict. But if all of the heap tuples are gone already, > >> removing the index pointers to them can'ẗ change the situation for any > >> query. If any of them should've been visible to a query, the damage was > >> done already by whoever pruned the heap tuples leaving just the > >> tombstone LP_DEAD item pointers (in the heap) behind. > > You're missing my point. Those tuples are indicators of what may lie > > elsewhere in the database, completely unreferenced by this WAL record. > > Just because these referenced tuples are gone doesn't imply that all > > tuple versions written by the as yet-unknown-xids are also gone. We > > can't infer anything about the whole database just from one small group > > of records. > Have you got an example of that? > >>> > >>> I don't need one, I have suggested the safe route. In order to infer > >>> anything, and thereby further optimise things, we would need proof that > >>> no cases can exist, which I don't have. Perhaps we can add "yet", not > >>> sure about that either. > >> > >> It's good to be safe rather than sorry, but I'd still like to know > >> because I'm quite surprised by that, and got me worried that I don't > >> understand how hot standby works as well as I thought I did. I thought > >> the point of stopping replay/killing queries at a b-tree deletion record > >> is precisely that it makes some heap tuples invisible to running > >> read-only queries. If it doesn't make any tuples invisible, why do any > >> queries need to be killed? And why was it OK for them to be running just > >> before replaying the b-tree deletion record? > > > > I'm sorry but I'm too busy to talk further on this today. Since we are > > discussing a further optimisation rather than a bug, I hope it is OK to > > come back to this again later. > > Would now be a good time to revisit this? I still don't see why a b-tree > deletion record should conflict with anything, if all the removed index > tuples point to just LP_DEAD tombstones in the heap. I want what you say to be true. The question is: is it? We just need to explain why that will never be a problem. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs wrote: > Simple tuning of btree_xlog_vacuum() using an idea I had a while back, > just never implemented. XXX comments removed. > > Allows us to avoid reading in blocks during VACUUM replay that are only > required for correctness of index scans. Review: 1. The block comment in XLogConfirmBufferIsUnpinned appears to be copied from somewhere else, and doesn't really seem appropriate for a new function since it refers to "the original coding of this routine". I think you could just delete the parenthesized portion of the comment. 2. This bit from ConfirmBufferIsUnpinned looks odd to me. + /* +* Found it. Now, pin/unpin the buffer to prove it's unpinned. +*/ + if (PinBuffer(buf, NULL)) + UnpinBuffer(buf, false); I don't think pinning and unpinning the buffer is sufficient to provide that it isn't otherwise pinned. If the buffer isn't in shared buffers at all, it seems clear that no one can have it pinned. But if it's present in shared buffers, it seems like you still need LockBufferForCleanup(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
On Mon, 2010-05-17 at 16:10 -0400, Tom Lane wrote: > Jim Nasby writes: > > On Apr 29, 2010, at 3:20 PM, Tom Lane wrote: > >> This is not the time to be hacking stuff like this. You haven't even > >> demonstrated that there's a significant performance issue here. > > > I tend to agree that this point of the cycle isn't a good one to be making > > changes, but your performance statement confuses me. If a fairly small > > patch means we can avoid un-necessary reads why shouldn't we avoid them? > > Well, by "time of the cycle" I meant "the day before beta1". I'm not > necessarily averse to making such a change at some point when it would > get more than no testing before hitting our long-suffering beta testers. > But I'd still want to see some evidence that there's a significant > performance improvement to be had. That patch only applies to one record type. However, since we've used Greg's design of spidering out to each heap record that can usually mean 150-200 random I/Os per btree delete. That will take some time, perhaps 1s per WAL record of this type on a very large I/O bound table. That's enough to give me cause for concern without performance measurements. To derive such measurements we'd need to instrument each record type, which we don't do right now either. It might be easier to have a look at the patch and see if you think its worth the fuss of measuring it. I don't think this is the patch that will correct the potential/ partially observed context switching issue, but we have yet to recreate that in lab conditions. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
Jim Nasby writes: > On Apr 29, 2010, at 3:20 PM, Tom Lane wrote: >> This is not the time to be hacking stuff like this. You haven't even >> demonstrated that there's a significant performance issue here. > I tend to agree that this point of the cycle isn't a good one to be making > changes, but your performance statement confuses me. If a fairly small patch > means we can avoid un-necessary reads why shouldn't we avoid them? Well, by "time of the cycle" I meant "the day before beta1". I'm not necessarily averse to making such a change at some point when it would get more than no testing before hitting our long-suffering beta testers. But I'd still want to see some evidence that there's a significant performance improvement to be had. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
On Apr 29, 2010, at 3:20 PM, Tom Lane wrote: > Simon Riggs writes: >> Objections to commit? > > This is not the time to be hacking stuff like this. You haven't even > demonstrated that there's a significant performance issue here. I tend to agree that this point of the cycle isn't a good one to be making changes, but your performance statement confuses me. If a fairly small patch means we can avoid un-necessary reads why shouldn't we avoid them? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()
Simon Riggs writes: > Objections to commit? This is not the time to be hacking stuff like this. You haven't even demonstrated that there's a significant performance issue here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote: > > > >> If none of the removed heap tuples were present anymore, we currently > >> return InvalidTransactionId, which kills/waits out all read-only > >> queries. But if none of the tuples were present anymore, the read-only > >> queries wouldn't have seen them anyway, so ISTM that we should treat > >> InvalidTransactionId return value as "we don't need to kill anyone". > > That's not the point. The tuples were not themselves the sole focus, > Yes, they were. We're replaying a b-tree deletion record, which removes > pointers to some heap tuples, making them unreachable to any read-only > queries. If any of them still need to be visible to read-only queries, > we have a conflict. But if all of the heap tuples are gone already, > removing the index pointers to them can'ẗ change the situation for any > query. If any of them should've been visible to a query, the damage was > done already by whoever pruned the heap tuples leaving just the > tombstone LP_DEAD item pointers (in the heap) behind. > >>> You're missing my point. Those tuples are indicators of what may lie > >>> elsewhere in the database, completely unreferenced by this WAL record. > >>> Just because these referenced tuples are gone doesn't imply that all > >>> tuple versions written by the as yet-unknown-xids are also gone. We > >>> can't infer anything about the whole database just from one small group > >>> of records. > >> Have you got an example of that? > > > > I don't need one, I have suggested the safe route. In order to infer > > anything, and thereby further optimise things, we would need proof that > > no cases can exist, which I don't have. Perhaps we can add "yet", not > > sure about that either. > > It's good to be safe rather than sorry, but I'd still like to know > because I'm quite surprised by that, and got me worried that I don't > understand how hot standby works as well as I thought I did. I thought > the point of stopping replay/killing queries at a b-tree deletion record > is precisely that it makes some heap tuples invisible to running > read-only queries. If it doesn't make any tuples invisible, why do any > queries need to be killed? And why was it OK for them to be running just > before replaying the b-tree deletion record? I'm sorry but I'm too busy to talk further on this today. Since we are discussing a further optimisation rather than a bug, I hope it is OK to come back to this again later. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
Simon Riggs wrote: > On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote: > >> If none of the removed heap tuples were present anymore, we currently >> return InvalidTransactionId, which kills/waits out all read-only >> queries. But if none of the tuples were present anymore, the read-only >> queries wouldn't have seen them anyway, so ISTM that we should treat >> InvalidTransactionId return value as "we don't need to kill anyone". > That's not the point. The tuples were not themselves the sole focus, Yes, they were. We're replaying a b-tree deletion record, which removes pointers to some heap tuples, making them unreachable to any read-only queries. If any of them still need to be visible to read-only queries, we have a conflict. But if all of the heap tuples are gone already, removing the index pointers to them can'ẗ change the situation for any query. If any of them should've been visible to a query, the damage was done already by whoever pruned the heap tuples leaving just the tombstone LP_DEAD item pointers (in the heap) behind. >>> You're missing my point. Those tuples are indicators of what may lie >>> elsewhere in the database, completely unreferenced by this WAL record. >>> Just because these referenced tuples are gone doesn't imply that all >>> tuple versions written by the as yet-unknown-xids are also gone. We >>> can't infer anything about the whole database just from one small group >>> of records. >> Have you got an example of that? > > I don't need one, I have suggested the safe route. In order to infer > anything, and thereby further optimise things, we would need proof that > no cases can exist, which I don't have. Perhaps we can add "yet", not > sure about that either. It's good to be safe rather than sorry, but I'd still like to know because I'm quite surprised by that, and got me worried that I don't understand how hot standby works as well as I thought I did. I thought the point of stopping replay/killing queries at a b-tree deletion record is precisely that it makes some heap tuples invisible to running read-only queries. If it doesn't make any tuples invisible, why do any queries need to be killed? And why was it OK for them to be running just before replaying the b-tree deletion record? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote: > If none of the removed heap tuples were present anymore, we currently > return InvalidTransactionId, which kills/waits out all read-only > queries. But if none of the tuples were present anymore, the read-only > queries wouldn't have seen them anyway, so ISTM that we should treat > InvalidTransactionId return value as "we don't need to kill anyone". > >>> That's not the point. The tuples were not themselves the sole focus, > >> Yes, they were. We're replaying a b-tree deletion record, which removes > >> pointers to some heap tuples, making them unreachable to any read-only > >> queries. If any of them still need to be visible to read-only queries, > >> we have a conflict. But if all of the heap tuples are gone already, > >> removing the index pointers to them can'ẗ change the situation for any > >> query. If any of them should've been visible to a query, the damage was > >> done already by whoever pruned the heap tuples leaving just the > >> tombstone LP_DEAD item pointers (in the heap) behind. > > > > You're missing my point. Those tuples are indicators of what may lie > > elsewhere in the database, completely unreferenced by this WAL record. > > Just because these referenced tuples are gone doesn't imply that all > > tuple versions written by the as yet-unknown-xids are also gone. We > > can't infer anything about the whole database just from one small group > > of records. > > Have you got an example of that? I don't need one, I have suggested the safe route. In order to infer anything, and thereby further optimise things, we would need proof that no cases can exist, which I don't have. Perhaps we can add "yet", not sure about that either. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
Simon Riggs wrote: > On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote: btree_redo: > /* >* Note that if all heap tuples were LP_DEAD then we will be >* returning InvalidTransactionId here. This seems very unlikely >* in practice. >*/ If none of the removed heap tuples were present anymore, we currently return InvalidTransactionId, which kills/waits out all read-only queries. But if none of the tuples were present anymore, the read-only queries wouldn't have seen them anyway, so ISTM that we should treat InvalidTransactionId return value as "we don't need to kill anyone". >>> That's not the point. The tuples were not themselves the sole focus, >> Yes, they were. We're replaying a b-tree deletion record, which removes >> pointers to some heap tuples, making them unreachable to any read-only >> queries. If any of them still need to be visible to read-only queries, >> we have a conflict. But if all of the heap tuples are gone already, >> removing the index pointers to them can'ẗ change the situation for any >> query. If any of them should've been visible to a query, the damage was >> done already by whoever pruned the heap tuples leaving just the >> tombstone LP_DEAD item pointers (in the heap) behind. > > You're missing my point. Those tuples are indicators of what may lie > elsewhere in the database, completely unreferenced by this WAL record. > Just because these referenced tuples are gone doesn't imply that all > tuple versions written by the as yet-unknown-xids are also gone. We > can't infer anything about the whole database just from one small group > of records. Have you got an example of that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
On Thu, 2010-04-22 at 11:28 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote: > >> btree_redo: > >>> /* > >>>* Note that if all heap tuples were LP_DEAD then we will be > >>>* returning InvalidTransactionId here. This seems very unlikely > >>>* in practice. > >>>*/ > >> If none of the removed heap tuples were present anymore, we currently > >> return InvalidTransactionId, which kills/waits out all read-only > >> queries. But if none of the tuples were present anymore, the read-only > >> queries wouldn't have seen them anyway, so ISTM that we should treat > >> InvalidTransactionId return value as "we don't need to kill anyone". > > > > That's not the point. The tuples were not themselves the sole focus, > > Yes, they were. We're replaying a b-tree deletion record, which removes > pointers to some heap tuples, making them unreachable to any read-only > queries. If any of them still need to be visible to read-only queries, > we have a conflict. But if all of the heap tuples are gone already, > removing the index pointers to them can'ẗ change the situation for any > query. If any of them should've been visible to a query, the damage was > done already by whoever pruned the heap tuples leaving just the > tombstone LP_DEAD item pointers (in the heap) behind. You're missing my point. Those tuples are indicators of what may lie elsewhere in the database, completely unreferenced by this WAL record. Just because these referenced tuples are gone doesn't imply that all tuple versions written by the as yet-unknown-xids are also gone. We can't infer anything about the whole database just from one small group of records. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
Simon Riggs wrote: > On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote: >> btree_redo: >>> /* >>> * Note that if all heap tuples were LP_DEAD then we will be >>> * returning InvalidTransactionId here. This seems very unlikely >>> * in practice. >>> */ >> If none of the removed heap tuples were present anymore, we currently >> return InvalidTransactionId, which kills/waits out all read-only >> queries. But if none of the tuples were present anymore, the read-only >> queries wouldn't have seen them anyway, so ISTM that we should treat >> InvalidTransactionId return value as "we don't need to kill anyone". > > That's not the point. The tuples were not themselves the sole focus, Yes, they were. We're replaying a b-tree deletion record, which removes pointers to some heap tuples, making them unreachable to any read-only queries. If any of them still need to be visible to read-only queries, we have a conflict. But if all of the heap tuples are gone already, removing the index pointers to them can'ẗ change the situation for any query. If any of them should've been visible to a query, the damage was done already by whoever pruned the heap tuples leaving just the tombstone LP_DEAD item pointers (in the heap) behind. Or do we use the latestRemovedXid value for something else as well? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby b-tree delete records review
On Thu, 2010-04-22 at 10:24 +0300, Heikki Linnakangas wrote: > btree_redo: > > case XLOG_BTREE_DELETE: > > > > /* > > * Btree delete records can conflict with standby queries. You > > * might think that vacuum records would conflict as well, but > > * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records > > * provide the highest xid cleaned by the vacuum of the heap > > * and so we can resolve any conflicts just once when that > > * arrives. After that any we know that no conflicts exist > > * from individual btree vacuum records on that index. > > */ > > { > > TransactionId latestRemovedXid = > > btree_xlog_delete_get_latestRemovedXid(record); > > xl_btree_delete *xlrec = (xl_btree_delete *) > > XLogRecGetData(record); > > > > /* > > * XXX Currently we put everybody on death row, because > > * currently _bt_delitems() supplies > > InvalidTransactionId. > > * This can be fairly painful, so providing a better > > value > > * here is worth some thought and possibly some effort > > to > > * improve. > > */ > > ResolveRecoveryConflictWithSnapshot(latestRemovedXid, > > xlrec->node); > > } > > break; > > The XXX comment is out-of-date, the latestRemovedXid value is calculated > by btree_xlog_delete_get_latestRemovedXid() nowadays. Removed, thanks. > If we're re-replaying the WAL record, for example after restarting the > standby server, btree_xlog_delete_get_latestRemovedXid() won't find the > deleted records and will return InvalidTransactionId. That's OK, because > until we reach again the point in WAL where we were before the restart, > we don't accept read-only connections so there's no-one to kill anyway, > but you do get a useless "Invalid latestRemovedXid reported, using > latestCompletedXid instead" message in the log (that shouldn't be > capitalized, BTW). > It would be nice to check if there's any potentially conflicting > read-only queries before calling > btree_xlog_delete_get_latestRemovedXid(), which can be quite expensive. Good idea. You're welcome to add such tuning yourself, if you like. > If the "Invalid latestRemovedXid reported, using latestCompletedXid > instead" message is going to happen commonly, I think it should be > downgraded to DEBUG1. If it's an unexpected scenario, it should be > upgraded to WARNING. Set to DEBUG because the above optimisation makes it return invalid much more frequently, which we don't want reported. > In btree_xlog_delete_get_latestRemovedXid: > > Assert(num_unused == 0); > > Can't that happen as well in a re-replay scenario, if a heap item was > vacuumed away later on? OK, will remove. The re-replay gets me every time. > > /* > > * Note that if all heap tuples were LP_DEAD then we will be > > * returning InvalidTransactionId here. This seems very unlikely > > * in practice. > > */ > > If none of the removed heap tuples were present anymore, we currently > return InvalidTransactionId, which kills/waits out all read-only > queries. But if none of the tuples were present anymore, the read-only > queries wouldn't have seen them anyway, so ISTM that we should treat > InvalidTransactionId return value as "we don't need to kill anyone". That's not the point. The tuples were not themselves the sole focus, they indicated the latestRemovedXid of the backend on the primary that had performed the deletion. So even if those tuples are no longer present there may be others with similar xids that would conflict, so we cannot ignore. Comment updated. > Why does btree_xlog_delete_get_latestRemovedXid() keep the > num_unused/num_dead/num_redirect counts, it doesn't actually do anything > with them. Probably a debug tool. Removed. Changes committed. Thanks for the review. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Tue, 2010-04-13 at 17:18 +0300, Heikki Linnakangas wrote: > I've reviewed your changes and they look correct to me; the main chunk > > of code is mine and that was tested by me. > > Ok, committed after fixing an obsoleted comment & other small > editorialization. Looks good, thanks. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
Simon Riggs wrote: > On Thu, 2010-04-08 at 19:02 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> OK, that seems better. I'm happy with that instead. >>> >>> Have you tested this? Is it ready to commit? >> Only very briefly. I think the code is ready, but please review and test >> to see I didn't miss anything. > > I'm going to need you to commit this. I'm on holiday now until 14 April, > so its not going to get a retest before then otherwise; its not smart to > commit and then go on holiday, IIRC. :-) > I've reviewed your changes and they look correct to me; the main chunk > of code is mine and that was tested by me. Ok, committed after fixing an obsoleted comment & other small editorialization. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Thu, 2010-04-08 at 19:02 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > OK, that seems better. I'm happy with that instead. > > > > Have you tested this? Is it ready to commit? > > Only very briefly. I think the code is ready, but please review and test > to see I didn't miss anything. I'm going to need you to commit this. I'm on holiday now until 14 April, so its not going to get a retest before then otherwise; its not smart to commit and then go on holiday, IIRC. I've reviewed your changes and they look correct to me; the main chunk of code is mine and that was tested by me. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
Simon Riggs wrote: > OK, that seems better. I'm happy with that instead. > > Have you tested this? Is it ready to commit? Only very briefly. I think the code is ready, but please review and test to see I didn't miss anything. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Thu, 2010-04-08 at 18:35 +0300, Heikki Linnakangas wrote: > > > So I have introduced the new mode ("snapshot mode") to enter hot > standby > > anyway. That avoids us having to screw around with the loop logic > for > > redo. I don't see any need to support the case of where we have no > WAL > > source defined, yet we want Hot Standby but we also want to allow > > somebody to drop a WAL file into pg_xlog at some future point. That > has > > no use case of value AFAICS and is too complex to add at this stage > of > > the release cycle. > > You don't need a new mode for that. Just do the same "are we > consistent now?" check you do in the loop once before calling > ReadRecord to fetch the record that follows the checkpoint pointer. > Attached is a patch to show what I mean. We just need to let > postmaster know that recovery has started a bit earlier, right after > processing the checkpoint record, not delaying it until we've read the > first record after it. OK, that seems better. I'm happy with that instead. Have you tested this? Is it ready to commit? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
Simon Riggs wrote: > In StartupXlog() when we get to the point where we "Find the first > record that logically follows the checkpoint", in the current code > ReadRecord() loops forever, spitting out > LOG: record with zero length at 0/C88 > ... > > That prevents us from going further down StartupXLog() to the point > where we start the InRedo loop and hence start hot standby. As long as > we retry we cannot progress further: this is the main problem. > > So in the patch, I have modified the retry test in ReadRecord() so it no > longer retries iff there is no WAL source defined. Now, when > ReadRecord() exits, record == NULL at that point and so we do not (and > cannot) enter the redo loop. Oh, I see. > So I have introduced the new mode ("snapshot mode") to enter hot standby > anyway. That avoids us having to screw around with the loop logic for > redo. I don't see any need to support the case of where we have no WAL > source defined, yet we want Hot Standby but we also want to allow > somebody to drop a WAL file into pg_xlog at some future point. That has > no use case of value AFAICS and is too complex to add at this stage of > the release cycle. You don't need a new mode for that. Just do the same "are we consistent now?" check you do in the loop once before calling ReadRecord to fetch the record that follows the checkpoint pointer. Attached is a patch to show what I mean. We just need to let postmaster know that recovery has started a bit earlier, right after processing the checkpoint record, not delaying it until we've read the first record after it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/twophase.c --- b/src/backend/access/transam/twophase.c *** *** 1719,1724 PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p) --- 1719,1806 } /* + * StandbyRecoverPreparedTransactions + * + * Scan the pg_twophase directory and setup all the required information to + * allow standby queries to treat prepared transactions as still active. + * This is never called at the end of recovery - we use + * RecoverPreparedTransactions() at that point. + * + * Currently we simply call SubTransSetParent() for any subxids of prepared + * transactions. + */ + void + StandbyRecoverPreparedTransactions(bool can_overwrite) + { + DIR *cldir; + struct dirent *clde; + + cldir = AllocateDir(TWOPHASE_DIR); + while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL) + { + if (strlen(clde->d_name) == 8 && + strspn(clde->d_name, "0123456789ABCDEF") == 8) + { + TransactionId xid; + char *buf; + TwoPhaseFileHeader *hdr; + TransactionId *subxids; + int i; + + xid = (TransactionId) strtoul(clde->d_name, NULL, 16); + + /* Already processed? */ + if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid)) + { + ereport(WARNING, + (errmsg("removing stale two-phase state file \"%s\"", + clde->d_name))); + RemoveTwoPhaseFile(xid, true); + continue; + } + + /* Read and validate file */ + buf = ReadTwoPhaseFile(xid, true); + if (buf == NULL) + { + ereport(WARNING, + (errmsg("removing corrupt two-phase state file \"%s\"", + clde->d_name))); + RemoveTwoPhaseFile(xid, true); + continue; + } + + /* Deconstruct header */ + hdr = (TwoPhaseFileHeader *) buf; + if (!TransactionIdEquals(hdr->xid, xid)) + { + ereport(WARNING, + (errmsg("removing corrupt two-phase state file \"%s\"", + clde->d_name))); + RemoveTwoPhaseFile(xid, true); + pfree(buf); + continue; + } + + /* + * Examine subtransaction XIDs ... they should all follow main + * XID, and they may force us to advance nextXid. + */ + subxids = (TransactionId *) + (buf + MAXALIGN(sizeof(TwoPhaseFileHeader))); + for (i = 0; i < hdr->nsubxacts; i++) + { + TransactionId subxid = subxids[i]; + + Assert(TransactionIdFollows(subxid, xid)); + SubTransSetParent(xid, subxid, can_overwrite); + } + } + } + FreeDir(cldir); + } + + /* * RecoverPreparedTransactions * * Scan the pg_twophase directory and reload shared-memory state for each *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 494,499 static XLogRecPtr minRecoveryPoint; /* local copy of --- 494,501 * ControlFile->minRecoveryPoint */ static bool updateMinRecoveryPoint = true; + static bool reachedMinRecoveryPoint = false; + static bool InRedo = false; /* *** *** 547,552 static void ValidateXLOGDirectoryStructure(void); --- 549,555 static void CleanupBackupHistory(void); static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force); static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt); + static void CheckRecoveryConsistency(void); static bool ValidXLOGHeader(XLogPageHead
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Thu, Apr 8, 2010 at 6:16 AM, Simon Riggs wrote: > If standby_mode is enabled and there is no source of WAL, then we get a > stream of messages saying > > LOG: record with zero length at 0/C88 > ... > > but most importantly we never get to the main recovery loop, so Hot > Standby never gets to start at all. We can't keep retrying the request > for WAL and at the same time enter the retry loop, executing lots of > things that expect non-NULL pointers using a NULL xlog pointer. This is pretty much a corner case, so I don't think it's a good idea to add a new mode to handle it. It also seems like it would be pretty inconsistent if we allow WAL to be dropped in pg_xlog, but only if we are also doing archive recovery or streaming replication. If we can't support this case with the same code path we use otherwise, I think we should revert to disallowing it. Having said that, I guess I don't understand how having a source of WAL solves the problem described above. Do we always have to read at least 1 byte of WAL from either SR or the archive before starting up? If not, why do we need to do so here? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Thu, 2010-04-08 at 13:33 +0300, Heikki Linnakangas wrote: > > If standby_mode is enabled and there is no source of WAL, then we get a > > stream of messages saying > > > > LOG: record with zero length at 0/C88 > > ... > > > > but most importantly we never get to the main recovery loop, so Hot > > Standby never gets to start at all. We can't keep retrying the request > > for WAL and at the same time enter the retry loop, executing lots of > > things that expect non-NULL pointers using a NULL xlog pointer. > > You mean it can't find even the checkpoint record to start replaying? Clearly I don't mean that. Otherwise it wouldn't be "start from a shutdown checkpoint". I think you are misunderstanding me. Let me explain in more detail though please also read the patch before replying, if you do. The patch I submitted at top of this thread works for allowing Hot Standby during recovery. Yes, of course that occurs when the database is consistent. The trick is to get recovery to the point where it can be enabled. The second patch on this thread presents a way to get the database to that point; it touches some of the other recovery code that you and Masao have worked on. We *must* touch that code if we are to enable Hot Standby in the way you desire. In StartupXlog() when we get to the point where we "Find the first record that logically follows the checkpoint", in the current code ReadRecord() loops forever, spitting out LOG: record with zero length at 0/C88 ... That prevents us from going further down StartupXLog() to the point where we start the InRedo loop and hence start hot standby. As long as we retry we cannot progress further: this is the main problem. So in the patch, I have modified the retry test in ReadRecord() so it no longer retries iff there is no WAL source defined. Now, when ReadRecord() exits, record == NULL at that point and so we do not (and cannot) enter the redo loop. So I have introduced the new mode ("snapshot mode") to enter hot standby anyway. That avoids us having to screw around with the loop logic for redo. I don't see any need to support the case of where we have no WAL source defined, yet we want Hot Standby but we also want to allow somebody to drop a WAL file into pg_xlog at some future point. That has no use case of value AFAICS and is too complex to add at this stage of the release cycle. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
Simon Riggs wrote: > On Tue, 2010-04-06 at 10:22 +0100, Simon Riggs wrote: > >> Initial patch. I will be testing over next day. No commit before at >> least midday on Wed 7 Apr. > > Various previous discussions sidelined a very important point: what > exactly does it mean to "start recovery from a shutdown checkpoint"? Hot standby should be possible as soon we know that the database is consistent. That is, as soon as we've replayed WAL past the minRecoveryPoint/backupStartPoint point indicated in pg_control. > If standby_mode is enabled and there is no source of WAL, then we get a > stream of messages saying > > LOG: record with zero length at 0/C88 > ... > > but most importantly we never get to the main recovery loop, so Hot > Standby never gets to start at all. We can't keep retrying the request > for WAL and at the same time enter the retry loop, executing lots of > things that expect non-NULL pointers using a NULL xlog pointer. You mean it can't find even the checkpoint record to start replaying? I think the behavior in that scenario is fine as it is. The database isn't consistent (or at least we can't know if it is, because we don't know the redo pointer) until you read and replay the first checkpoint record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Startup at shutdown checkpoint
On Tue, 2010-04-06 at 10:22 +0100, Simon Riggs wrote: > Initial patch. I will be testing over next day. No commit before at > least midday on Wed 7 Apr. Various previous discussions sidelined a very important point: what exactly does it mean to "start recovery from a shutdown checkpoint"? If standby_mode is enabled and there is no source of WAL, then we get a stream of messages saying LOG: record with zero length at 0/C88 ... but most importantly we never get to the main recovery loop, so Hot Standby never gets to start at all. We can't keep retrying the request for WAL and at the same time enter the retry loop, executing lots of things that expect non-NULL pointers using a NULL xlog pointer. What we are asking for here is a completely new state: the database is not "in recovery" - by definition there is nothing at all to recover. The following patch adds "Snapshot Mode", a very simple variation on the existing code - emphasis on the "simple": LOG: entering snapshot mode LOG: record with zero length at 0/C88 LOG: consistent recovery state reached at 0/C88 LOG: database system is ready to accept read only connections this mode does *not* continually check to see if new WAL files have been added. Startup just sits and waits, backends allowed. If a trigger file is specified, then we can leave recovery. Otherwise Startup process just sits doing nothing. There's possibly an argument for inventing some more special modes where we do allow read only connections but don't start the bgwriter. I don't personally wish to do this at this stage of the release cycle. The attached patch is non-invasive and safe and I want to leave it at that. I will be committing later today, unless major objections, but I ask you to read the patch before you sharpen your pen. It's simple. -- Simon Riggs www.2ndQuadrant.com diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index e2566a4..365cd17 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1719,6 +1719,88 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p) } /* + * StandbyRecoverPreparedTransactions + * + * Scan the pg_twophase directory and setup all the required information to + * allow standby queries to treat prepared transactions as still active. + * This is never called at the end of recovery - we use + * RecoverPreparedTransactions() at that point. + * + * Currently we simply call SubTransSetParent() for any subxids of prepared + * transactions. + */ +void +StandbyRecoverPreparedTransactions(bool can_overwrite) +{ + DIR *cldir; + struct dirent *clde; + + cldir = AllocateDir(TWOPHASE_DIR); + while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL) + { + if (strlen(clde->d_name) == 8 && + strspn(clde->d_name, "0123456789ABCDEF") == 8) + { + TransactionId xid; + char *buf; + TwoPhaseFileHeader *hdr; + TransactionId *subxids; + int i; + + xid = (TransactionId) strtoul(clde->d_name, NULL, 16); + + /* Already processed? */ + if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid)) + { +ereport(WARNING, + (errmsg("removing stale two-phase state file \"%s\"", +clde->d_name))); +RemoveTwoPhaseFile(xid, true); +continue; + } + + /* Read and validate file */ + buf = ReadTwoPhaseFile(xid, true); + if (buf == NULL) + { +ereport(WARNING, + (errmsg("removing corrupt two-phase state file \"%s\"", + clde->d_name))); +RemoveTwoPhaseFile(xid, true); +continue; + } + + /* Deconstruct header */ + hdr = (TwoPhaseFileHeader *) buf; + if (!TransactionIdEquals(hdr->xid, xid)) + { +ereport(WARNING, + (errmsg("removing corrupt two-phase state file \"%s\"", + clde->d_name))); +RemoveTwoPhaseFile(xid, true); +pfree(buf); +continue; + } + + /* + * Examine subtransaction XIDs ... they should all follow main + * XID, and they may force us to advance nextXid. + */ + subxids = (TransactionId *) +(buf + MAXALIGN(sizeof(TwoPhaseFileHeader))); + for (i = 0; i < hdr->nsubxacts; i++) + { +TransactionId subxid = subxids[i]; + +Assert(TransactionIdFollows(subxid, xid)); +SubTransSetParent(xid, subxid, can_overwrite); + } + } + } + FreeDir(cldir); +} + +/* * RecoverPreparedTransactions * * Scan the pg_twophase directory and reload shared-memory state for each diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3000ab7..d26b369 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -180,6 +180,7 @@ static TimestampTz recoveryLastXTime = 0; /* options taken from recovery.conf for XLOG streaming */ static bool StandbyMode = false; +static bool SnapshotMode = false; static char *PrimaryConnInfo = NULL; char *TriggerFile = NULL; @@ -3845,7 +3846,7 @@ next_record_is_invalid: } /* In standby-mode, keep trying *
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Tue, 2010-03-02 at 12:47 -0800, Marc Munro wrote: > IIUC this is only a problem for WAL from HOT updates and vacuums. If no > vacuums or HOT updates have been performed, there is no risk of > returning bad data. So WAL that does not contain HOT updates or vacuums > could be applied on the standby without risk, even if there are > long-running queries in play. This is not a complete solution but may > reduce the likelihood of queries having to be cancelled. I guess the > approach here would be to check WAL before applying it, and only cancel > queries if the WAL contains HOT updates or vacuums. That's what we do. > Taking the idea further, if WAL records contained the tid of the latest > tuples that were overwritten, even more WAL could be applied without > having to cancel queries. > > To take it further still, if vacuum on the master could be prevented > from touching records that are less than max_standby_delay seconds old, > it would be safe to apply WAL from the very latest vacuum. I guess HOT > could be handled similarly though that may eliminate much of the > advantage of HOT updates. Thanks for your ideas. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On 3/2/10 12:47 PM, Marc Munro wrote: > To take it further still, if vacuum on the master could be prevented > from touching records that are less than max_standby_delay seconds old, > it would be safe to apply WAL from the very latest vacuum. I guess HOT > could be handled similarly though that may eliminate much of the > advantage of HOT updates. Aside from the inability to convert between transcation count and time, isn't this what vacuum_defer_cleanup_age is supposed to do? Or does it not help with HOT updates? --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Fri, 2010-02-26 at 03:33 -0500, Greg Smith wrote: > I really hope this discussion can say focused on if and how it's > possible to improve this area, with the goal being to deliver a product > everyone can be proud of with the full feature set that makes this next > release a killer one. The features that have managed to all get into > this release already are fantastic, everyone who contributed should be > proud of that progress, and it's encouraging that the alpha4 date was > nailed. It would be easy to descend into finger-pointing for why > exactly this particular problem is only getting more visibility now, or > into schedule-oriented commentary suggesting it must be ignored because > it's too late to do anything about it. I hope everyone appreciates > wandering that way will not help make PostgreSQL 9.0 a better release. > This issue is so easy to encounter, and looks so bad when it happens, > that I feel it could easily lead to an embarrassing situation for the > community if something isn't done about it before release. Thanks Greg. It's a great relief for me to hear someone else say this and to watch a discussion about this important issue unfold. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Fri, 2010-02-26 at 16:44 -0500, Tom Lane wrote: > Greg Stark writes: > > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane wrote: > >> There's *definitely* not going to be enough information in the WAL > >> stream coming from a master that doesn't think it has HS slaves. > >> We can't afford to record all that extra stuff in installations for > >> which it's just useless overhead. BTW, has anyone made any attempt > >> to measure the performance hit that the patch in its current form is > >> creating via added WAL entries? > > > What extra entries? > > Locks, just for starters. I haven't read enough of the code yet to know > what else Simon added. In the past it's not been necessary to record > any transient information in WAL, but now we'll have to. There is room for technical confusion here, so I'll just add some info. There was/is potential for performance hit because of the volume of additional WAL *and* the processing overhead from that additional WAL. As Heikki points out these turn out to be minimal, though this has been by careful design. There is also potential for a performance hit because incoming cleanup records may conflict with currently executing queries. If we knew for certain that the master was not sending any cleanup records that would effect current standby queries we could avoid that overhead altogether. That's a good reason for integrating a solution. AccessExclusiveLock lock records are nothing at all to do with that. They exist simply to prevent obvious correctness issues such as somebody reading a file while it is being deleted. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark wrote: On Sun, Feb 28, 2010 at 5:28 AM, Greg Smith wrote: The idea of the workaround is that if you have a single long-running query to execute, and you want to make sure it doesn't get canceled because of a vacuum cleanup, you just have it connect back to the master to keep an open snapshot the whole time. Also, I'm not sure this actually works. When your client makes this additional connection to the master it's connecting at some transaction in the future from the slave's point of view. The master could have already vacuumed away some record which the snapshot the client gets on the slave will have in view. Right, and there was an additional comment in the docs alluding to some sleep time on the master that intends to try and improve thins. If you knew how long archive_timeout was you could try to sleep longer than it to try and increase your odds of avoiding an ugly spot. But there are race conditions galore possible here, particularly if your archiver or standby catchup is backlogged. Still it's a handy practical trick even if it isn't 100% guaranteed to work. But I don't think it provides the basis for something we can bake in. Agreed on both counts, which is why it's in the current docs as a workaround people can consider, but not what I've been advocating as the right way to proceed. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Bruce Momjian wrote: "The first option is to connect to the primary server and keep a query active for as long as needed to run queries on the standby. This guarantees that a WAL cleanup record is never generated and query conflicts do not occur, as described above. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms." I am unclear how you would easily advance the snapshot as each query completes on the slave. The idea of the workaround is that if you have a single long-running query to execute, and you want to make sure it doesn't get canceled because of a vacuum cleanup, you just have it connect back to the master to keep an open snapshot the whole time. That's basically the same idea that vacuum_defer_cleanup_age implements, except you don't have to calculate a value--you just hold open the snapshot to do it. When that query ended, its snapshot would be removed, and then the master would advance to whatever the next latest one is. Nothing fancier than that. The only similarity is that if you made every query that happened on the standby do that, it would effectively be the same behavior I'm suggesting could be available via the standby->master xmin publication. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > Josh Berkus wrote: > > > Now that I think about it, the xmin thing really doesn't seem > > conceptually difficult. If the slave just opens a 2nd, special query > > connection back to the master and publishes its oldest xmin there, as > > far as the master is concerned, it's just another query backend. > > Could it be that easy? > > > > Something just like that is in fact already suggested as a workaround in > the Hot Standby manual: > > "The first option is to connect to the primary server and keep a query > active for as long as needed to run queries on the standby. This > guarantees that a WAL cleanup record is never generated and query > conflicts do not occur, as described above. This could be done using > contrib/dblink and pg_sleep(), or via other mechanisms." I am unclear how you would easily advance the snapshot as each query completes on the slave. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Josh Berkus wrote: Hey, take it easy! I read Stark's post as tongue-in-cheek, which I think it was. Yeah, I didn't get that. We've already exchanged mutual off-list apologies for the misunderstanding in both directions, I stopped just short of sending flowers. I did kick off this discussion with noting a clear preference this not wander into any personal finger-pointing. And I am far too displeased with the technical situation here to have much of a sense of humor left about it either. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
> Thank you for combining a small personal attack with a selfish > commentary about how yours is the only valid viewpoint. Saves me a lot > of trouble replying to your messages, can just ignore them instead if > this is how you're going to act. Hey, take it easy! I read Stark's post as tongue-in-cheek, which I think it was. Though, Stark, if you're going to be flip, I'd suggest using a smiley next time. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
> The part I still don't have good visibility on is how much of the > necessary SR infrastructure needed to support this communications > channel is already available in some form. I had though the walsender > on the master was already receiving messages sometimes from the > walreceiver on the standby, but I'm getting the impression from Heikki's > comments that this not the case at all yet. I don't think asking for a 2nd connection back from the standby to the master would be horrible for 9.0. I think it would be quite reasonable, actually; even with 2 connections per slave, you could still have quite a few slaves before the # of connections bogged down the master. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Josh Berkus wrote: Now that I think about it, the xmin thing really doesn't seem conceptually difficult. If the slave just opens a 2nd, special query connection back to the master and publishes its oldest xmin there, as far as the master is concerned, it's just another query backend. Could it be that easy? Something just like that is in fact already suggested as a workaround in the Hot Standby manual: "The first option is to connect to the primary server and keep a query active for as long as needed to run queries on the standby. This guarantees that a WAL cleanup record is never generated and query conflicts do not occur, as described above. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms." And the idea of doing it mainly in client land has its attractions. The main reason I wandered toward asking about it in the context of SR is that there's already this open "Standby delay on idle system" issue with Hot Standby, and the suggested resolution for that problem involves publishing keep-alive data with timestamps over SR. While all these problems and potential solutions have been floating around for a long time, as you pointed out, the little flash of insight I had here was that it's possible to bundle these two problems together with a combined keep-alive timestamp+xmin message that goes in both directions. That removes one serious Hot Standby issue altogether, and adds an additional conflict avoidance mechanism for people who want to enable it, all with something that needs to get done sooner or later anyway for sync rep. The part I still don't have good visibility on is how much of the necessary SR infrastructure needed to support this communications channel is already available in some form. I had though the walsender on the master was already receiving messages sometimes from the walreceiver on the standby, but I'm getting the impression from Heikki's comments that this not the case at all yet. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg, > If you think of it in those terms, the idea that "you need to run PITR > backup/archive recovery" to not get that behavior isn't an important > distinction anymore. If you run SR with the option enabled you could > get it, any other setup and you won't. +1. I always expected that we'd get this kind of behavior with "synch" in 9.1. I can see that there are two desired modes of behaviour depending on what the replication config is: 1) Master full-speed, degraded operation on slaves: this is the current wal_standby_delay approach. It has the advantage of supporting possibly hundreds of slaves, and certainly dozens. 2) Master burdened, full operation on slaves: this is the publish-xmin-back-to-master approach, which IIRC the core team first discussed at pgCon 2008 before Simon started work, and which you and Tom seem to think can be done soon. I can see people wanting to use either mode depending on their use-case. Or, for that matter, using both modes to different slaves. Now that I think about it, the xmin thing really doesn't seem conceptually difficult. If the slave just opens a 2nd, special query connection back to the master and publishes its oldest xmin there, as far as the master is concerned, it's just another query backend. Could it be that easy? Also, I'm serious about what I suggested earlier for "delay" mode. We should have an option for cancelled queries to be immediately retried, if that's feasible. It would turn something which is now a major application design issue (lots of cancelled queries) into just degrated performance. Overall, though, I'd say that getting 9.0 out the door relatively on-time is more important than getting it perfect. "Release early, release often" isn't just a mantra; it's a very good idea if you want your project to keep improving and not bog down and fall apart. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > Joshua D. Drake wrote: > > On Sat, 27 Feb 2010 00:43:48 +, Greg Stark wrote: > > > >> I want my ability to run large batch queries without any performance > >> or reliability impact on the primary server. > >> > > > > +1 > > > > I can use any number of other technologies for high availability. > > > > Remove "must be an instant-on failover at the same time" from the > requirements and you don't even need 9.0 to handle that, this has been a > straightforward to solve problem since 8.2. It's the combination of HA > and queries that make things hard to do. > > If you just want batch queries on another system without being concerned > about HA at the same time, the first option is to just fork the base > backup and WAL segment delivery to another server and run queries there. That is a lot of administrative overhead. It is hard to say it is equivalent to HS. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Heikki Linnakangas wrote: > Dimitri Fontaine wrote: >> Bruce Momjian writes: >>> Doesn't the system already adjust the delay based on the length of slave >>> transactions, e.g. max_standby_delay. It seems there is no need for a >>> user switch --- just max_standby_delay really high. >> Well that GUC looks like it allows to set a compromise between HA and >> reporting, not to say "do not ever give the priority to the replay while >> I'm running my reports". At least that's how I understand it. > > max_standby_delay=-1 does that. The documentation needs to be updated to > reflect that, it currently says: > >> There is no wait-forever setting because of the potential for deadlock which >> that setting would introduce. This parameter can only be set in the >> postgresql.conf file or on the server command line. > > but that is false, -1 means wait forever. Simon removed that option at > one point, but it was later put back and apparently the documentation > was never updated. I've put back the mention of max_standby_delay=-1 option in the docs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Dimitri Fontaine wrote: > Bruce Momjian writes: >> Doesn't the system already adjust the delay based on the length of slave >> transactions, e.g. max_standby_delay. It seems there is no need for a >> user switch --- just max_standby_delay really high. > > Well that GUC looks like it allows to set a compromise between HA and > reporting, not to say "do not ever give the priority to the replay while > I'm running my reports". At least that's how I understand it. max_standby_delay=-1 does that. The documentation needs to be updated to reflect that, it currently says: > There is no wait-forever setting because of the potential for deadlock which > that setting would introduce. This parameter can only be set in the > postgresql.conf file or on the server command line. but that is false, -1 means wait forever. Simon removed that option at one point, but it was later put back and apparently the documentation was never updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Aidan Van Dyk wrote: Would we (ya, the royal we) be willing to say that if you want the benifit of removing the MVCC overhead of long-running queries you need to run PITR backup/archive recovery, and if you want SR, you get a closed-loop master-follows-save-xmin behaviour? To turn that question around a little, I think it's reasonable to say that closed-loop master-follows-slave-xmin behavior is only practical to consider implementing with SR--and even there, it should be optional rather than required until there's more field experience on the whole thing. Whether it's the default or not could take a bit of debate to sort out too. If you think of it in those terms, the idea that "you need to run PITR backup/archive recovery" to not get that behavior isn't an important distinction anymore. If you run SR with the option enabled you could get it, any other setup and you won't. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
* Greg Smith [100226 23:39]: > Just not having the actual query running on the master is such a > reduction in damage that I think it's delivering the essence of what > people are looking for regardless. That it might be possible in some > cases to additionally avoid the overhead that comes along with any > long-running query is a nice bonus, and it's great the design allows for > that possibility. But if that's only possible with risk, heavy > tweaking, and possibly some hacks, I'm not sure that's making the right > trade-offs for everyone. Would we (ya, the royal we) be willing to say that if you want the benifit of removing the MVCC overhead of long-running queries you need to run PITR backup/archive recovery, and if you want SR, you get a closed-loop master-follows-save-xmin behaviour? a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark wrote: But if they move from having a plain old PITR warm standby to having one they can run queries on they might well assume that the big advantage of having the standby to play with is precisely that they can do things there that they have never been able to do on the master previously without causing damage. Just not having the actual query running on the master is such a reduction in damage that I think it's delivering the essence of what people are looking for regardless. That it might be possible in some cases to additionally avoid the overhead that comes along with any long-running query is a nice bonus, and it's great the design allows for that possibility. But if that's only possible with risk, heavy tweaking, and possibly some hacks, I'm not sure that's making the right trade-offs for everyone. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Joshua D. Drake wrote: On Sat, 27 Feb 2010 00:43:48 +, Greg Stark wrote: I want my ability to run large batch queries without any performance or reliability impact on the primary server. +1 I can use any number of other technologies for high availability. Remove "must be an instant-on failover at the same time" from the requirements and you don't even need 9.0 to handle that, this has been a straightforward to solve problem since 8.2. It's the combination of HA and queries that make things hard to do. If you just want batch queries on another system without being concerned about HA at the same time, the first option is to just fork the base backup and WAL segment delivery to another server and run queries there. Some simple filesystem snapshot techniques will also suffice to handle it all on the same standby. Stop warm standby recovery, snapshot, trigger the server, run your batch job; once finished, rollback to the snapshot, grab the latest segment files, and resume standby catchup. Even the lame Linux LVM snapshot features can handle that job--one of my coworkers has the whole thing scripted even this is so common. And if you have to go live because there's a failover, you're back to the same "cold standby" situation a large max_standby_delay puts you at, so it's not even very different from what you're going to get in 9.0 if this is your priority mix. The new version is just lowering the operational complexity involved. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Sat, 27 Feb 2010 00:43:48 +, Greg Stark wrote: > On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith wrote: >> This is also the reason why the whole "pause recovery" idea is a >> fruitless >> path to wander down. The whole point of this feature is that people >> have a >> secondary server available for high-availability, *first and foremost*, >> but >> they'd like it to do something more interesting that leave it idle all >> the >> time. The idea that you can hold off on applying standby updates for >> long >> enough to run seriously long reports is completely at odds with the idea >> of >> high-availability. > I want my ability to run large batch queries without any performance > or reliability impact on the primary server. +1 I can use any number of other technologies for high availability. Joshua D. Drake -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Sat, Feb 27, 2010 at 2:43 AM, Greg Smith wrote: > > But if you're running the 8 hour report on the master right now, aren't you > already exposed to a similar pile of bloat issues while it's going? If I > have the choice between "sometimes queries will get canceled" vs. "sometimes > the master will experience the same long-running transaction bloat issues as > in earlier versions even if the query runs on the standby", I feel like > leaning toward the latter at least leads to a problem people are used to. If they move from running these batch queries on the master to running them on the slave then sure, the situation will be no worse than before. But if they move from having a plain old PITR warm standby to having one they can run queries on they might well assume that the big advantage of having the standby to play with is precisely that they can do things there that they have never been able to do on the master previously without causing damage. I agree that having queries randomly and unpredictably canceled is pretty awful. My argument was that max_standby_delay should default to infinity on the basis that any other value has to be picked based on actual workloads and SLAs. My feeling is that there are probably only two types of configurations that make sense, a HA replica with a low max_standby_delay or a reporting replica with a high max_standby_delay. Any attempt to combine the two on the same system will only work if you know your application well and can make compromises with both. I would also like to be able to handle load balancing read-only queries but that will be really tricky. You want up-to-date data and you want to be able to run moderately complex queries. That kind of workload may well require synchronous replication to really work properly. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark wrote: Eh? That's not what I meant at all. Actually it's kind of the exact opposite of what I meant. Sorry about that--I think we just hit one of those language usage drift bits of confusion. "Sit in the corner" has a very negative tone to it in US English and I interpreted your message badly as a result. A Google search for images using that phrase will quickly show you what I mean. What I meant was that your description of the "High Availability first and foremost" is only one possible use case. Simon in the past expressed the same single-minded focus on that use case. It's a perfectly valid use case and I would probably agree if we had to choose just one it would be the most important. Sure, there are certainly others, and as much as possible more flexibility here is a good thing. What I was suggesting is that if the only good way to handle long-running queries has no choice but to sacrifice high-availability, which is is the situation if max_standby_delay is the approach you use, then the most obvious users for this feature are not being well served by that situation. I would guess a large portion of the users looking forward to Hot Standby are in the "have an underutilized high-availability standby I'd like to use for offloading long running reports", and if there is no way to serve them well this feature is missing the mark a bit. You really can't do any better without better master/standby integration though, and as pointed out a couple of times here that was considered and just not followed through on yet. I'm increasingly concerned that nothing else will really do though. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > Heikki Linnakangas wrote: > > One such landmine is that the keepalives need to flow from client to > > server while the WAL records are flowing from server to client. We'll > > have to crack that problem for synchronous replication too, but I think > > that alone is a big enough problem to make this 9.1 material. > > > > This seems to be the real sticking point then, given that the > xmin/PGPROC side on the master seems logically straightforward. For > some reason I thought the sync rep feature had the reverse message flow > already going, and that some other sort of limitation just made it > impractical to merge into the main codebase this early. My hope was > that just this particular part could get cherry-picked out of there, and > that it might even have been thought about already in that context given > the known HS keepalive "serious issue". If there was a solution or > partial solution in progress to that floating around, my thought was > that just piggybacking this extra xid info on top of it would be easy > enough. > > If there's not already a standby to primary communications backchannel > implementation available that can be harvested from that work, your > suggestion that this may not be feasible at all for 9.0 seems like a > more serious concern than I had thought it was going to be. I suspect the master could connect to the slave to pull an xid. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Heikki Linnakangas wrote: One such landmine is that the keepalives need to flow from client to server while the WAL records are flowing from server to client. We'll have to crack that problem for synchronous replication too, but I think that alone is a big enough problem to make this 9.1 material. This seems to be the real sticking point then, given that the xmin/PGPROC side on the master seems logically straightforward. For some reason I thought the sync rep feature had the reverse message flow already going, and that some other sort of limitation just made it impractical to merge into the main codebase this early. My hope was that just this particular part could get cherry-picked out of there, and that it might even have been thought about already in that context given the known HS keepalive "serious issue". If there was a solution or partial solution in progress to that floating around, my thought was that just piggybacking this extra xid info on top of it would be easy enough. If there's not already a standby to primary communications backchannel implementation available that can be harvested from that work, your suggestion that this may not be feasible at all for 9.0 seems like a more serious concern than I had thought it was going to be. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Bruce Momjian wrote: Well, I think the choice is either you delay vacuum on the master for 8 hours or pile up 8 hours of WAL files on the slave, and delay application, and make recovery much slower. It is not clear to me which option a user would prefer because the bloat on the master might be permanent. But if you're running the 8 hour report on the master right now, aren't you already exposed to a similar pile of bloat issues while it's going? If I have the choice between "sometimes queries will get canceled" vs. "sometimes the master will experience the same long-running transaction bloat issues as in earlier versions even if the query runs on the standby", I feel like leaning toward the latter at least leads to a problem people are used to. This falls into the principle of least astonishment category to me. Testing the final design for how transactions get canceled here led me to some really unexpected situations, and the downside for a mistake is "your query is lost". Had I instead discovered that sometimes long-running transactions on the standby can ripple back to cause a maintenance slowdown on the master, that's not great. But it would not have been so surprising, and it won't result in lost query results. I think people will expect that their queries cancel because of things like DDL changes. And the existing knobs allow inserting some slack for things like locks taking a little bit of time to acquire sometimes. What I don't think people will see coming is that a routine update on an unrelated table is going to kill a query they might have been waiting hours for the result of, just because that update crossed an autovacuum threshold for the other table and introduced a dead row cleanup. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Sat, Feb 27, 2010 at 1:53 AM, Greg Smith wrote: > Greg Stark wrote: >> >> Well you can go sit in the same corner as Simon with your high >> availability servers. >> >> I want my ability to run large batch queries without any performance >> or reliability impact on the primary server. >> > > Thank you for combining a small personal attack with a selfish commentary > about how yours is the only valid viewpoint. Saves me a lot of trouble > replying to your messages, can just ignore them instead if this is how > you're going to act. Eh? That's not what I meant at all. Actually it's kind of the exact opposite of what I meant. What I meant was that your description of the "High Availability first and foremost" is only one possible use case. Simon in the past expressed the same single-minded focus on that use case. It's a perfectly valid use case and I would probably agree if we had to choose just one it would be the most important. But we don't have to choose just one. There are other valid use cases such as load balancing and isolating your large batch queries from your production systems. I don't want us to throw out all these other use cases because we only consider high availability as the only use case we're interested in. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark wrote: Well you can go sit in the same corner as Simon with your high availability servers. I want my ability to run large batch queries without any performance or reliability impact on the primary server. Thank you for combining a small personal attack with a selfish commentary about how yours is the only valid viewpoint. Saves me a lot of trouble replying to your messages, can just ignore them instead if this is how you're going to act. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > You can think of the idea of passing an xmin back from the standby as > being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no > standby queries are running, but grows in size to match longer ones. And > you don't have to have to know anything to set it correctly; just toggle > on the proposed "feedback xid from the standby" feature and you're safe. Yes, there is no question there is value in passing the xid back to the master. My point is that it is like your blog entry: http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html you can't have all the options: 1. agressive vacuum on master 2. fast recovery of slave 3. no query cancel on slave Again, pick two. Passing the xid back to the master gives you #2 and #3, and that might be good for some people, but not others. Do we have any idea which options most administrators would want? If we get xid passback to the master, do we keep the other configuration options? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Fri, Feb 26, 2010 at 11:56 PM, Greg Smith wrote: > This is also the reason why the whole "pause recovery" idea is a fruitless > path to wander down. The whole point of this feature is that people have a > secondary server available for high-availability, *first and foremost*, but > they'd like it to do something more interesting that leave it idle all the > time. The idea that you can hold off on applying standby updates for long > enough to run seriously long reports is completely at odds with the idea of > high-availability. Well you can go sit in the same corner as Simon with your high availability servers. I want my ability to run large batch queries without any performance or reliability impact on the primary server. You can have one or the other but you can't get both. If you set max_standby_delay low then you get your high availability server, if you set it high you get a useful report server. If you build sync replication which we don't have today and which will open another huge can of usability worms when we haven't even finish bottling the two we've already opened then you lose the lack of impact on the primary. Suddenly the queries you run on the slaves cause your production database to bloat. Plus you have extra network connections which take resources on your master and have to be kept up at all times or you lose your slaves. I think the design constraint of not allowing any upstream data flow is actually very valuable. Eventually we'll have it for sync replication but it's much better that we've built things incrementally and can be sure that nothing really depends on it for basic functionality. This is what allows us to know that the slave imposes no reliability impact on the master. It's what allows us to know that everything will work identically regardless of whether you have a walreceiver running or are running off archived log files. Remember I wanted to entirely abstract away the walreciever and allow multiple wal communication methods. I think it would make more sense to use something like Spread to distribute the logs so the master only has to send them once and as many slaves as you want can pick them up. The current architecture doesn't scale very well if you want to have hundreds of slaves for one master. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > Bruce Momjian wrote: > > Doesn't the system already adjust the delay based on the length of slave > > transactions, e.g. max_standby_delay. It seems there is no need for a > > user switch --- just max_standby_delay really high. > > > > The first issue is that you're basically saying "I don't care about high > availability anymore" when you increase max_standby_delay to a high > value. Want to offload an 8 hour long batch report every day to the > standby? You can do it with max_standby_delay=8 hours. But the day > your master crashes 7 hours into that, you're in for a long wait before > your standby is available while it replays all the queued up segments. > Your 'hot standby' has actually turned into the old form of 'cold > standby' just when you need it to be responsive. Well, I think the choice is either you delay vacuum on the master for 8 hours or pile up 8 hours of WAL files on the slave, and delay application, and make recovery much slower. It is not clear to me which option a user would prefer because the bloat on the master might be permanent. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Bruce Momjian wrote: 5 Early cleanup of data still visible to the current query's snapshot #5 could be handled by using vacuum_defer_cleanup_age on the master. Why is vacuum_defer_cleanup_age not listed in postgresql.conf? I noticed that myself and fired off a corrective patch to Simon yesterday, he said it was intentional but not sure why that is yet. We'll sort that out. You are correct that my suggestion is targeting primarily #5 on this list. There are two problems with the possible solutions using that parameter though: -vacuum_defer_cleanup_age is set in a unit that people cannot be expected to work in--transactions ids. The UI is essentially useless, and there's no obvious way how to make a better one. The best you can do will still be really fragile. -If you increase vacuum_defer_cleanup_age, it's active all the time. You're basically making every single transaction that could be cleaned up pay for the fact that a query *might* be running on the standby it needs to avoid. You can think of the idea of passing an xmin back from the standby as being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no standby queries are running, but grows in size to match longer ones. And you don't have to have to know anything to set it correctly; just toggle on the proposed "feedback xid from the standby" feature and you're safe. Expecting that anyone will ever set vacuum_defer_cleanup_age correctly in the field in its current form is pretty unreasonable I think. Since there's no timestamp-based memory of past xid activity, it's difficult to convert it to that form instead, and I think something in terms of time is what people would like to set this in. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Bruce Momjian wrote: Doesn't the system already adjust the delay based on the length of slave transactions, e.g. max_standby_delay. It seems there is no need for a user switch --- just max_standby_delay really high. The first issue is that you're basically saying "I don't care about high availability anymore" when you increase max_standby_delay to a high value. Want to offload an 8 hour long batch report every day to the standby? You can do it with max_standby_delay=8 hours. But the day your master crashes 7 hours into that, you're in for a long wait before your standby is available while it replays all the queued up segments. Your 'hot standby' has actually turned into the old form of 'cold standby' just when you need it to be responsive. This is also the reason why the whole "pause recovery" idea is a fruitless path to wander down. The whole point of this feature is that people have a secondary server available for high-availability, *first and foremost*, but they'd like it to do something more interesting that leave it idle all the time. The idea that you can hold off on applying standby updates for long enough to run seriously long reports is completely at odds with the idea of high-availability. The second major problem is that the day the report actually takes 8.1 hours instead, because somebody else ran another report that slowed you down a little, you're screwed if that's something you depend on being available--it just got canceled only *after* wasting 8 hours of reporting resource time. max_standby_delay is IMHO only useful for allowing non-real-time web-app style uses of HS (think "Facebook status updates"), where you say "I'm OK giving people slightly out of date info sometimes if it lets me split the query load over two systems". Set max_standby_delay to a few seconds or maybe a minute, enough time to service a typical user query, make your app tolerate the occasional failed query and only send big ones to the master, and you've just scaled up all the small ones. Distributed queries with "eventual consistency" on all nodes is where many of the web app designs are going, and this feature is a reasonable match for that use case. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Dimitri Fontaine wrote: > Bruce Momjian writes: > > Doesn't the system already adjust the delay based on the length of slave > > transactions, e.g. max_standby_delay. It seems there is no need for a > > user switch --- just max_standby_delay really high. > > Well that GUC looks like it allows to set a compromise between HA and > reporting, not to say "do not ever give the priority to the replay while > I'm running my reports". At least that's how I understand it. Well, if you set it high, it effectively is that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Bruce Momjian writes: > Doesn't the system already adjust the delay based on the length of slave > transactions, e.g. max_standby_delay. It seems there is no need for a > user switch --- just max_standby_delay really high. Well that GUC looks like it allows to set a compromise between HA and reporting, not to say "do not ever give the priority to the replay while I'm running my reports". At least that's how I understand it. The feedback loop might get expensive on master server when running reporting queries on the slave, unless you can "pause" it explicitly I think. I don't see how the system will guess that you're running a reporting server rather than a HA node, and max_standby_delay is just a way to tell the standby to please be nice in case of abuse. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Dimitri Fontaine wrote: > Tom Lane writes: > > Well, as Heikki said, a stop-and-go WAL management approach could deal > > with that use-case. What I'm concerned about here is the complexity, > > reliability, maintainability of trying to interlock WAL application with > > slave queries in any sort of fine-grained fashion. > > Some admin functions for Hot Standby were removed from the path to ease > its integration, there was a pause() and resume() feature. > > I think that offering this explicit control to the user would allow them > to choose between HA setup and reporting setup easily enough: just pause > the replay when running the reporting, resume it to get fresh data > again. If you don't pause, any query can get killed, replay is the > priority. Doesn't the system already adjust the delay based on the length of slave transactions, e.g. max_standby_delay. It seems there is no need for a user switch --- just max_standby_delay really high. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark writes: > On Fri, Feb 26, 2010 at 9:19 PM, Tom Lane wrote: >> There's *definitely* not going to be enough information in the WAL >> stream coming from a master that doesn't think it has HS slaves. >> We can't afford to record all that extra stuff in installations for >> which it's just useless overhead. BTW, has anyone made any attempt >> to measure the performance hit that the patch in its current form is >> creating via added WAL entries? > What extra entries? Locks, just for starters. I haven't read enough of the code yet to know what else Simon added. In the past it's not been necessary to record any transient information in WAL, but now we'll have to. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Tom Lane writes: > Well, as Heikki said, a stop-and-go WAL management approach could deal > with that use-case. What I'm concerned about here is the complexity, > reliability, maintainability of trying to interlock WAL application with > slave queries in any sort of fine-grained fashion. Some admin functions for Hot Standby were removed from the path to ease its integration, there was a pause() and resume() feature. I think that offering this explicit control to the user would allow them to choose between HA setup and reporting setup easily enough: just pause the replay when running the reporting, resume it to get fresh data again. If you don't pause, any query can get killed, replay is the priority. Now as far as the feedback loop is concerned, I guess the pause() function would cause the slave to stop publishing any xmin in the master's procarray so that it's free to vacuum and archive whatever it wants to. Should the slave accumulate too much lag, it will resume from the archive rather than live from the SR link. How much that helps? Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
bruce wrote: > 4 The standby waiting longer than max_standby_delay to acquire a ... > #4 can be controlled by max_standby_delay, where a large value only > delays playback during crash recovery --- again, a rare occurance. One interesting feature is that max_standby_delay will _only_ delay if it it encounters a conflict when applying WAL, meaning only if there are long-running transactions. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark writes: > Why shouldn't it have any queries at walreceiver startup? It has any > xlog segments that were copied from the master and any it can find in > the archive, it could easily reach a consistent point long before it > needs to connect to the master. If you really want to protect your > master from any additional overhead you don't currently need to > configure a streaming connection at all, you can just use the file > shipping interface. There's *definitely* not going to be enough information in the WAL stream coming from a master that doesn't think it has HS slaves. We can't afford to record all that extra stuff in installations for which it's just useless overhead. BTW, has anyone made any attempt to measure the performance hit that the patch in its current form is creating via added WAL entries? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
> Well, as Heikki said, a stop-and-go WAL management approach could deal > with that use-case. What I'm concerned about here is the complexity, > reliability, maintainability of trying to interlock WAL application with > slave queries in any sort of fine-grained fashion. This sounds a bit brute-force, but what about simply having some form of automatic query retry on the slave? --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Stark writes: > On Fri, Feb 26, 2010 at 7:16 PM, Tom Lane wrote: >> I don't see a "substantial additional burden" there. What I would >> imagine is needed is that the slave transmits a single number back >> --- its current oldest xmin --- and the walsender process publishes >> that number as its transaction xmin in its PGPROC entry on the master. > And when we want to support cascading slaves? So? Fits right in. The walsender on the first-level slave is advertising an xmin from the second-level one, which will be included in what's passed back up to the master. > Or when you want to bring up a new slave and it suddenly starts > advertising a new xmin that's older than the current oldestxmin? How's it going to do that, when it has no queries at the instant of startup? > But in any case if I were running a reporting database I would want it > to just stop replaying logs for a few hours while my big batch report > runs, not cause the master to be unable to vacuum any dead records for > hours. That defeats much of the purpose of running the queries on the > slave. Well, as Heikki said, a stop-and-go WAL management approach could deal with that use-case. What I'm concerned about here is the complexity, reliability, maintainability of trying to interlock WAL application with slave queries in any sort of fine-grained fashion. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On 2/26/10 6:57 AM, Richard Huxton wrote: > > Can we not wait to cancel the transaction until *any* new lock is > attempted though? That should protect all the single-statement > long-running transactions that are already underway. Aggregates etc. I like this approach. Is it fragile in some non-obvious way? --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Heikki Linnakangas wrote: > > How to handle situations where the standby goes away for a while, > > such as a network outage, so that it doesn't block the master from ever > > cleaning up dead tuples is a concern. > > Yeah, that's another issue that needs to be dealt with. You'd probably > need some kind of a configurable escape valve in the master, to let it > ignore a standby's snapshot once it gets too old. > > > But I do know that the current Hot Standby implementation is going to be > > frustrating to configure correctly for people. > > Perhaps others who are not as deep into the code as I am will have a > better view on this, but I seriously don't think that's such a big > issue. I think the max_standby_delay setting is quite intuitive and easy > to explain. Sure, it would better if there was no tradeoff between > killing queries and stalling recovery, but I don't think it'll be that > hard to understand the tradeoff. Let's look at the five documented cases of query conflict (from our manual): 1 Access Exclusive Locks from primary node, including both explicit LOCK commands and various DDL actions 2 Dropping tablespaces on the primary while standby queries are using those tablespaces for temporary work files (work_mem overflow) 3 Dropping databases on the primary while users are connected to that database on the standby. 4 The standby waiting longer than max_standby_delay to acquire a buffer cleanup lock. 5 Early cleanup of data still visible to the current query's snapshot We might have a solution to #1 by only cancelling queries that try to take locks. #2 and #3 seem like rare occurances. #4 can be controlled by max_standby_delay, where a large value only delays playback during crash recovery --- again, a rare occurance. #5 could be handled by using vacuum_defer_cleanup_age on the master. Why is vacuum_defer_cleanup_age not listed in postgresql.conf? In summary, I think passing snapshots to the master is not something possible for 9.0, and ideally we will never need to add that feature. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On Fri, Feb 26, 2010 at 10:21 AM, Heikki Linnakangas wrote: > Richard Huxton wrote: >> Can we not wait to cancel the transaction until *any* new lock is >> attempted though? That should protect all the single-statement >> long-running transactions that are already underway. Aggregates etc. > > Hmm, that's an interesting thought. You'll still need to somehow tell > the victim backend "you have to fail if you try to acquire any more > locks", but a single per-backend flag in the procarray would suffice. > > You could also clear the flag whenever you free the last snapshot in the > transaction (ie. between each query in read committed mode). Wow, that seems like it would help a lot. Although I'm not 100% sure I follow all the details of how this works. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Replying to my own post - first sign of madness... Let's see if I've got the concepts clear here, and hopefully my thinking it through will help others reading the archives. There are two queues: 1. Cleanup on the master 2. Replay on the slave Running write queries on the master adds to both queues. Running (read-only) queries on the slave prevents you removing from both queues. There are two interesting measurements of "age"/"size": 1. Oldest item in / length of queue (knowable) 2. How long will it take to clear the queue (estimable at best) You'd like to know #2 to keep up with your workload. Unfortunately, you can't for certain unless you have control over new incoming queries (on both master and slave). You might want four separate GUCs for the two measurements on the two queues. We currently have two that (sort of) match #1 "Oldest item" (vacuum_defer_cleanup_age, max_standby_delay). Delaying replay on a slave has no effect on the master. If a slave falls too far behind it's responsible for catch-up (via normal WAL archives). There is no point in delaying cleanup on the master unless it's going to help one or more slaves. In fact, you don't want to start delaying cleanup until you have to, otherwise you're wasting your delay time. This seems to be the case with vacuum_defer_cleanup_age. If I have a heavily-updated table and I defer vacuuming then when any given query starts on the slave it's going to be half used up already. There's also no point in deferring cleanup on the master if the standby is already waiting on a conflict that will cause its queries to be cancelled anyway. Not only won't it help, but it might make things worse since transactions will be cancelled, the conflict will be replayed and (presumably) queries will be re-submitted only to be cancelled again. This is what Greg Smith's discussion of the keep-alives was about. Giving the master enough information to be smarter about cleanup (and making the conflicts more fine-grained). The situation with deferring on one or both ends of process just gets more complicated with multiple slaves. There's all sorts of unpleasant feedback loops I can envisage there. For the case of single slave being used to run long reporting queries the ideal scenario would be the following. Master starts deferring vacuum activity just before the query starts. When that times out, the slave will receive the cleanup info, refuse to replay it and start its delay. This gives you a total available query time of: natural time between vacuums + vacuum delay + WAL transfer time + standby delay I can think of five useful things we should be doing (and might be already - don't know). 1. On the master, deduce whether the slave is already waiting on a query. If so, don't bother delaying cleanup. Clearly you don't want to be signalling hundreds of times a second though. Does the slave pause fetching via streaming replication if replay is blocked on a query? Could we signal "half-way to max-age" or some such? 2. Perhaps simpler than trying to make the master smarter, just allow SET this_transaction_is_probably_a_long_one=true on the slave. That (a) clears the queue on the slave and (b) sends the signal to the master which then starts deferring vacuum. 3. Do a burst of cleanup activity on the master after blocking. This should concentrate conflicts together when they reach the slave. Perhaps vacuum_defer_cleanup_age should be vacuum_deferred_queue_size and measure the amount of work to do, rather than the max age of the oldest cleanup (if I've understood correctly). 4. Do a burst of replay on the slave after blocking. Perhaps every time it cancels a transaction it should replay at least half the queued WAL before letting new transactions start. Or perhaps it replays any vacuum activity it comes across and then stops. That should sync with #2 assuming the slave doesn't lag the master too much. 5. I've been mixing "defer" and "delay", as do the docs. We should probably settle on one or the other. I think defer conveys the meaning more precisely, but what about non-native English speakers? -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Richard Huxton wrote: > Can we not wait to cancel the transaction until *any* new lock is > attempted though? That should protect all the single-statement > long-running transactions that are already underway. Aggregates etc. Hmm, that's an interesting thought. You'll still need to somehow tell the victim backend "you have to fail if you try to acquire any more locks", but a single per-backend flag in the procarray would suffice. You could also clear the flag whenever you free the last snapshot in the transaction (ie. between each query in read committed mode). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On 26/02/10 14:45, Heikki Linnakangas wrote: Richard Huxton wrote: On 26/02/10 08:33, Greg Smith wrote: I'm not sure what you might be expecting from the above combination, but what actually happens is that many of the SELECT statements on the table *that isn't even being updated* are canceled. You see this in the logs: Hmm - this I'd already figured out for myself. It's just occurred to me that this could well be the case between databases too. Database A gets vacuumed, B gets its queries kicked off on the standby. No, it's per-database already. Only queries in the same database are canceled. That's a relief. Dumb non-hacker question: why do we cancel all transactions rather than just those with "ACCESS SHARE" on the vacuumed table in question? Is it the simple fact that we don't know what table this particular section of WAL affects, or is it the complexity of tracking all this info? The problem is that even if transaction X doesn't have an (access share) lock on the vacuumed table at the moment, it might take one in the future. Simon proposed mechanisms for storing the information about vacuumed tables in shared memory, so that if X takes the lock later on it will get canceled at that point, but that's 9.1 material. I see - we'd need to age the list of vacuumed tables too, so when the oldest transactions complete the correct flags get cleared. Can we not wait to cancel the transaction until *any* new lock is attempted though? That should protect all the single-statement long-running transactions that are already underway. Aggregates etc. -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Richard Huxton wrote: > On 26/02/10 08:33, Greg Smith wrote: >> I'm not sure what you might be expecting from the above combination, but >> what actually happens is that many of the SELECT statements on the table >> *that isn't even being updated* are canceled. You see this in the logs: > > Hmm - this I'd already figured out for myself. It's just occurred to me > that this could well be the case between databases too. Database A gets > vacuumed, B gets its queries kicked off on the standby. No, it's per-database already. Only queries in the same database are canceled. > Dumb non-hacker question: why do we cancel all transactions rather than > just those with "ACCESS SHARE" on the vacuumed table in question? Is it > the simple fact that we don't know what table this particular section of > WAL affects, or is it the complexity of tracking all this info? The problem is that even if transaction X doesn't have an (access share) lock on the vacuumed table at the moment, it might take one in the future. Simon proposed mechanisms for storing the information about vacuumed tables in shared memory, so that if X takes the lock later on it will get canceled at that point, but that's 9.1 material. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On 26/02/10 14:10, Heikki Linnakangas wrote: Ideally the standby would stash away the old pages or tuples somewhere so that it can still access them even after replaying the WAL records that remove them from the main storage. I realize that's not going to happen any time soon because it's hard to do, but that would really be the most robust fix possible. Something like snapshotting a filesystem, so updates continue while you're still looking at a static version. -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
On 26/02/10 08:33, Greg Smith wrote: There are a number of HS tunables that interact with one another, and depending your priorities a few ways you can try to optimize the configuration for what I expect to be common use cases for this feature. > I've written a blog entry at http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html that tries to explain all that background clearly, It did too. Thanks for the nice summary people can be pointed at. I'm not sure what you might be expecting from the above combination, but what actually happens is that many of the SELECT statements on the table *that isn't even being updated* are canceled. You see this in the logs: Hmm - this I'd already figured out for myself. It's just occurred to me that this could well be the case between databases too. Database A gets vacuumed, B gets its queries kicked off on the standby. Granted lots of people just have the one main DB, but even so... LOG: restored log file "000100A5" from archive ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. STATEMENT: SELECT sum(abalance) FROM pgbench_accounts; Basically, every time a WAL segment appears that wipes out a tuple that SELECT expects should still be visible, because the dead row left behind by the update has been vacuumed away, the query is canceled. This happens all the time the way I've set this up, and I don't feel like this is a contrived demo. Having a long-running query on the standby while things get updated and then periodically autovacuumed on the primary is going to be extremely common in the sorts of production systems I expect want HS the most. I can pretty much everyone wanting HS+SR. Thousands of small DBs running on VMs for a start. Free mostly-live backup? Got to be a winner. Dumb non-hacker question: why do we cancel all transactions rather than just those with "ACCESS SHARE" on the vacuumed table in question? Is it the simple fact that we don't know what table this particular section of WAL affects, or is it the complexity of tracking all this info? If you're running a system that also is using Streaming Replication, there is a much better approach possible. "Requires keep-alives with timestamps to be added to sync rep feature" If those keep-alives flowed in both directions, and included both timestamps *and* xid visibility information, the master could easily be configured to hold open xid snapshots needed for long running queries on the standby when that was necessary. Presumably meaning we need *another* config setting to prevent excessive bloat on a heavily updated table on the master. -- Richard Huxton Archonet Ltd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby query cancellation and Streaming Replication integration
Greg Smith wrote: > Attached is a tar file with some test case demo scripts that demonstrate > the worst of the problems here IMHO. Thanks for that! We've been discussing this for ages, so it's nice to have a concrete example. > I don't want to belittle that work because it's been important to make > HS a useful standalone feature, but I feel like that's all triage rather > than looking for the most robust fix possible. Ideally the standby would stash away the old pages or tuples somewhere so that it can still access them even after replaying the WAL records that remove them from the main storage. I realize that's not going to happen any time soon because it's hard to do, but that would really be the most robust fix possible. > I don't know how difficult the keepalive feature was expected to be, and > there's certainly plenty of potential landmines in this whole xid export > idea. One such landmine is that the keepalives need to flow from client to server while the WAL records are flowing from server to client. We'll have to crack that problem for synchronous replication too, but I think that alone is a big enough problem to make this 9.1 material. > How to handle situations where the standby goes away for a while, > such as a network outage, so that it doesn't block the master from ever > cleaning up dead tuples is a concern. Yeah, that's another issue that needs to be dealt with. You'd probably need some kind of a configurable escape valve in the master, to let it ignore a standby's snapshot once it gets too old. > But I do know that the current Hot Standby implementation is going to be > frustrating to configure correctly for people. Perhaps others who are not as deep into the code as I am will have a better view on this, but I seriously don't think that's such a big issue. I think the max_standby_delay setting is quite intuitive and easy to explain. Sure, it would better if there was no tradeoff between killing queries and stalling recovery, but I don't think it'll be that hard to understand the tradeoff. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
Fujii Masao wrote: > On Mon, Feb 8, 2010 at 10:34 PM, Bruce Momjian wrote: > > Ahh, good point. ?I had not considered the table would change. ?What I > > did was to mark "Slaves accept read-only queries" as "Hot only". > > Can the "warm standby" still reside in v9.0? If not, the mark of > "Hot only" seems odd for me. Yes, both hot and warm standby is supported in 9.0. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
On Mon, Feb 8, 2010 at 10:34 PM, Bruce Momjian wrote: > Ahh, good point. I had not considered the table would change. What I > did was to mark "Slaves accept read-only queries" as "Hot only". Can the "warm standby" still reside in v9.0? If not, the mark of "Hot only" seems odd for me. > I did not change "Master failure will never lose data" because the 9.0 > streaming implementation is not sychronous (see wal_sender_delay in > postgresql.conf), and I don't think even setting that to zero makes the > operation synchronous. I think we will have to wait for PG 9.1 for > _synchronous_ streaming replication. You are right. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
Markus Wanner wrote: > Bruce, > > Bruce Momjian wrote: > > Ah, I now realize it only mentions "warm" standby, not "hot", so I just > > updated the documentation to reflect that; you can see it here: > > Maybe the table below also needs an update, because unlike "Warm Standby > using PITR", a hot standby accepts read-only queries and can be > configured to not loose data on master failure. Ahh, good point. I had not considered the table would change. What I did was to mark "Slaves accept read-only queries" as "Hot only". You can see the result here: http://momjian.us/tmp/pgsql/high-availability.html I did not change "Master failure will never lose data" because the 9.0 streaming implementation is not sychronous (see wal_sender_delay in postgresql.conf), and I don't think even setting that to zero makes the operation synchronous. I think we will have to wait for PG 9.1 for _synchronous_ streaming replication. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
On Feb 7, 2010, at 12:35 PM, Josh Berkus wrote: >> I've always thought this feature was misnamed and nothing has happened >> to change my mind, but it's not clear whether I'm in the majority. > > I'm afraid force of habit is more powerful than correctness on this one. > It's going to be HS/SR whether that's perfectly correct or not. What would be correct? I thought HS/SR were pretty correct (as long as no one confuses SR with synchronous replication!). Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
> I've always thought this feature was misnamed and nothing has happened > to change my mind, but it's not clear whether I'm in the majority. I'm afraid force of habit is more powerful than correctness on this one. It's going to be HS/SR whether that's perfectly correct or not. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and DROP DATABASE
On Sat, 2010-02-06 at 17:32 +0100, Andres Freund wrote: > > > > So it seems at least the behavior is quite different from what the > > docs stats. Am I missing something here? > Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase > > The line: > CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, > true); > > has to be > CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_DATABASE, > true); Well spotted, thanks for the report and the analysis. The code for drop database worked when committed but it looks like the re-factoring of the code broke it. Will fix. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
On Sun, Feb 7, 2010 at 4:41 AM, Markus Wanner wrote: > Bruce Momjian wrote: >> Do we want to call the feature "hot standby"? Is a read-only standby a >> "standby" or a "slave"? > > I think hot standby is pretty much the term, now. See here for the previous iteration of this discussion: http://archives.postgresql.org/pgsql-hackers/2009-08/msg00870.php I've always thought this feature was misnamed and nothing has happened to change my mind, but it's not clear whether I'm in the majority. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
Bruce, Bruce Momjian wrote: Ah, I now realize it only mentions "warm" standby, not "hot", so I just updated the documentation to reflect that; you can see it here: Maybe the table below also needs an update, because unlike "Warm Standby using PITR", a hot standby accepts read-only queries and can be configured to not loose data on master failure. Do we want to call the feature "hot standby"? Is a read-only standby a "standby" or a "slave"? I think hot standby is pretty much the term, now. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and DROP DATABASE
On Saturday 06 February 2010 17:32:43 Andres Freund wrote: > On Saturday 06 February 2010 02:25:33 Tatsuo Ishii wrote: > > Hi, > > > > While testing Hot Standby, I have encountered strange behavior with > > DROP DATABASE command. > > > > 1) connect to "test" database at standby via psql > > 2) issue DROP DATABASE test command to primary > > 3) session #1 works fine > > 4) close session #1 > > 5) "test" database dropped on standby > > > > Fromt the manual: > > Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER > > DATABASE ... RENAME on primary will generate a log message that will > > cause all users connected to that database on the standby to be > > forcibly disconnected. This action occurs immediately, whatever the > > setting of max_standby_delay. > > > > So it seems at least the behavior is quite different from what the > > docs stats. Am I missing something here? > > Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase > > The line: > CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, > true); For the case it should not be clear, the reason that PROCSIG_RECOVERY_CONFLICT_TABLESPACE did not kill the session is that currently all tablespace conflicts are valid only inside a transaction, so when receiving the recovery conflict checks whether its not inside a transaction block "anymore" and continues happily. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and DROP DATABASE
On Saturday 06 February 2010 02:25:33 Tatsuo Ishii wrote: > Hi, > > While testing Hot Standby, I have encountered strange behavior with > DROP DATABASE command. > > 1) connect to "test" database at standby via psql > 2) issue DROP DATABASE test command to primary > 3) session #1 works fine > 4) close session #1 > 5) "test" database dropped on standby > > Fromt the manual: > > Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER > DATABASE ... RENAME on primary will generate a log message that will > cause all users connected to that database on the standby to be > forcibly disconnected. This action occurs immediately, whatever the > setting of max_standby_delay. > > So it seems at least the behavior is quite different from what the > docs stats. Am I missing something here? Its a small bug/typo in standby.c:ResolveRecoveryConflictWithDatabase The line: CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, true); has to be CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_DATABASE, true); Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby documentation
Joshua Tolley wrote: -- Start of PGP signed section. > Having concluded I really need to start playing with hot standby, I started > looking for documentation on the subject. I found what I was looking for; I > also found this page[1], which, it seems, ought to mention hot standby. > Comments? > > [1] http://developer.postgresql.org/pgdocs/postgres/high-availability.html Ah, I now realize it only mentions "warm" standby, not "hot", so I just updated the documentation to reflect that; you can see it here: http://momjian.us/tmp/pgsql/high-availability.html Warm and Hot Standby Using Point-In-Time Recovery (PITR) Do we want to call the feature "hot standby"? Is a read-only standby a "standby" or a "slave"? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
Simon Riggs writes: > On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote: >> I've concluded that that's too large a change to undertake for 9.0 > The purpose of this was to make the big changes in 9.0. If we aren't > going to do that it seems like we shouldn't bother at all. No, the purpose of this was to get rid of VACUUM FULL INPLACE in 9.0. I'm not interested in destabilizing the code (even more) just to avoid one small internal kluge. The proposed magic field in struct Relation is the only part of this that I'd foresee reverting later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote: > I've concluded that that's too large a change to undertake for 9.0 The purpose of this was to make the big changes in 9.0. If we aren't going to do that it seems like we shouldn't bother at all. So why not flip back to the easier approach of make something work for HS only and then do everything you want to do in the next release? The burst radius of the half-way changes you are proposing seems high in comparison. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
Tom Lane wrote: > I've been playing around with different alternatives for solving the > problem of toast-pointer OIDs, but I keep coming back to the above as > being the least invasive and most robust answer. There are two basic > ways that we could do it: pass the OID to use to the toast logic, which > would require adding a parameter to heap_insert and a number of other > places; or add a field to struct Relation that says "when inserting a > TOAST pointer in this relation, use this OID as the toast-table OID > value in the pointer, even if that's different from what the table's OID > appears to be". The latter seems like less of a notational change, so > I'm leaning to that, but wanted to see if anyone prefers the other. > > We could avoid this hackery if there were a way for Relation structs to > point at either the old or the new physical relation (relfilenode); > then we'd not need the transient "new heap" relation during CLUSTER/VF, > which would be good for reducing catalog churn. I've concluded that > that's too large a change to undertake for 9.0, but it might be > interesting to try in the future. So I'd prefer that what we do for > now touch as little code as possible so as to be easy to revert; hence > I'm not wanting to change heap_insert's signature. I don't think any of this affects pg_migrator, but if it does, please let me know. When I hear TOAST and OID used in the same sentence, my ears perk up. :-) -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
I wrote: > * We can not change the toast rel OID of a shared catalog -- there's no > way to propagate that into the other copies of pg_class. So we need to > rejigger the logic for heap rewriting a little bit. Toast rel swapping > has to be handled by swapping their relfilenodes not their OIDs. This > is no big deal as far as cluster.c itself is concerned, but the tricky > part is that when we write new toasted values into the new toast rel, > the TOAST pointers going into the new heap have to be written with the > original toast-table OID value not the one that the transient target > toast rel has got. This is doable but it would uglify the TOAST API a > bit I think. I've been playing around with different alternatives for solving the problem of toast-pointer OIDs, but I keep coming back to the above as being the least invasive and most robust answer. There are two basic ways that we could do it: pass the OID to use to the toast logic, which would require adding a parameter to heap_insert and a number of other places; or add a field to struct Relation that says "when inserting a TOAST pointer in this relation, use this OID as the toast-table OID value in the pointer, even if that's different from what the table's OID appears to be". The latter seems like less of a notational change, so I'm leaning to that, but wanted to see if anyone prefers the other. We could avoid this hackery if there were a way for Relation structs to point at either the old or the new physical relation (relfilenode); then we'd not need the transient "new heap" relation during CLUSTER/VF, which would be good for reducing catalog churn. I've concluded that that's too large a change to undertake for 9.0, but it might be interesting to try in the future. So I'd prefer that what we do for now touch as little code as possible so as to be easy to revert; hence I'm not wanting to change heap_insert's signature. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution
On Tue, 2010-02-02 at 20:27 +0200, Heikki Linnakangas wrote: > > I'd appreciate it if you could review the relation-specific conflict > > patch, 'cos it's still important. > > One fundamental gripe I have about that approach is that it's hard to > predict when you will be saved by the cache and when your query will be > canceled. For example, the patch stores only one "latestRemovedXid" > value per lock partition. So if you have two tables that hash to > different lock partitions, and are never both accessed in a single > transaction, the cache will save your query every time. So far so good, > but then you do a dump/restore, and the tables happen to be assigned to > the same lock partition. Oops, a system that used to work fine starts to > get "snapshot too old" errors. > > It's often better to be consistent and predictable, even if it means > cancelling more queries. I think wë́'d need to have a much more > fine-grained system before it's worthwhile to do deferred resolution. > There's just too much "false sharing" otherwise. ISTM that this is exactly backwards. There is already way too many false positives and this patch would reduce them very significantly. Plus the cancelation is hardly predictable since it relies on whether or not a btree delete takes place during execution and the arrival time and rate of those is sporadic. There is no safe, predicatable behaviour in the current code. The gripe about the cache cannot be a fundamental one, since we can easily change the size and mechanism by which the cache operates without changing the patch very much at all. I am being told this area is a must-fix issue for this release. Tom's reaction to this issue (on other thread) illustrates that beautifully: On Sun, 2010-01-31 at 15:41 -0500, Tom Lane wrote: > Simon Riggs writes: (snip) > > 2. no matter if they haven't accessed the index being cleaned (they > > might later, is the thinking...) > > That seems seriously horrid. What is the rationale for #2 in > particular? I would hope that at worst this would affect sessions > that are actively competing for the index being cleaned. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution
Simon Riggs wrote: > On Fri, 2010-01-29 at 15:01 +, Simon Riggs wrote: > >> Putting it back takes time and >> given enough of that rare cloth, it will eventually be put back. > > Looks like I'll have time to add the starts-at-shutdown-checkpoint item > back in after all. Great! Thank you, much appreciated. > I'd appreciate it if you could review the relation-specific conflict > patch, 'cos it's still important. One fundamental gripe I have about that approach is that it's hard to predict when you will be saved by the cache and when your query will be canceled. For example, the patch stores only one "latestRemovedXid" value per lock partition. So if you have two tables that hash to different lock partitions, and are never both accessed in a single transaction, the cache will save your query every time. So far so good, but then you do a dump/restore, and the tables happen to be assigned to the same lock partition. Oops, a system that used to work fine starts to get "snapshot too old" errors. It's often better to be consistent and predictable, even if it means cancelling more queries. I think wë́'d need to have a much more fine-grained system before it's worthwhile to do deferred resolution. There's just too much "false sharing" otherwise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
I wrote: > The design I sketched doesn't require such an assumption anyway. Once > the map file is written, the relocation is effective, commit or no. > As long as we restrict relocations to maintenance operations such as > VACUUM FULL, which have no transactionally significant results, this > doesn't seem like a problem. What we do need is that after a CLUSTER > or V.F., which is going to relocate not only the rel but its indexes, > the relocations of the rel and its indexes have to all "commit" > atomically. But saving up the transaction's map changes and applying > them in one write takes care of that. BTW, I noticed a couple of other issues that need to be dealt with to make that safe. During CLUSTER/V.F. we typically try to update the relation's relfilenode, relpages, reltuples, relfrozenxid (in setNewRelfilenode) as well as its toastrelid (in swap_relation_files). These are regular transactional updates to the pg_class tuple that will fail to commit if the outer transaction rolls back. However: * For a mapped relation, both the old and new relfilenode will be zero, so it doesn't matter. * Possibly losing the updates of relpages and reltuples is not critical. * For relfrozenxid, we can simply force the new and old values to be the same rather than hoping to advance the value, if we're dealing with a mapped relation. Or just let it be; I think that losing an advance of relfrozenxid wouldn't be critical either. * We can not change the toast rel OID of a shared catalog -- there's no way to propagate that into the other copies of pg_class. So we need to rejigger the logic for heap rewriting a little bit. Toast rel swapping has to be handled by swapping their relfilenodes not their OIDs. This is no big deal as far as cluster.c itself is concerned, but the tricky part is that when we write new toasted values into the new toast rel, the TOAST pointers going into the new heap have to be written with the original toast-table OID value not the one that the transient target toast rel has got. This is doable but it would uglify the TOAST API a bit I think. Another possibility is to treat the toast rel OID of a catalog as something that can be supplied by the map file. Not sure which way to jump. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby: Relation-specific deferred conflict resolution
On Fri, 2010-01-29 at 15:01 +, Simon Riggs wrote: > Putting it back takes time and > given enough of that rare cloth, it will eventually be put back. Looks like I'll have time to add the starts-at-shutdown-checkpoint item back in after all. I'd appreciate it if you could review the relation-specific conflict patch, 'cos it's still important. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and deadlock detection
On Mon, 2010-02-01 at 17:50 +0200, Heikki Linnakangas wrote: > Umm, so why not run the deadlock check immediately in > max_standby_delay=-1 case as well? Why is that case handled differently > from max_standby_delay>0 case? Done, tested, working. Will commit tomorrow if no further questions or comments. -- Simon Riggs www.2ndQuadrant.com *** a/src/backend/storage/ipc/procsignal.c --- b/src/backend/storage/ipc/procsignal.c *** *** 272,277 procsignal_sigusr1_handler(SIGNAL_ARGS) --- 272,280 if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT); + if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)) + RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); + if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); *** a/src/backend/storage/ipc/standby.c --- b/src/backend/storage/ipc/standby.c *** *** 127,132 WaitExceedsMaxStandbyDelay(void) --- 127,135 long delay_secs; int delay_usecs; + if (MaxStandbyDelay == -1) + return false; + /* Are we past max_standby_delay? */ TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(), &delay_secs, &delay_usecs); *** *** 351,365 ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) * they hold one of the buffer pins that is blocking Startup process. If so, * backends will take an appropriate error action, ERROR or FATAL. * ! * A secondary purpose of this is to avoid deadlocks that might occur between ! * the Startup process and lock waiters. Deadlocks occur because if queries * wait on a lock, that must be behind an AccessExclusiveLock, which can only ! * be clared if the Startup process replays a transaction completion record. ! * If Startup process is waiting then that is a deadlock. If we allowed a ! * setting of max_standby_delay that meant "wait forever" we would then need ! * special code to protect against deadlock. Such deadlocks are rare, so the ! * code would be almost certainly buggy, so we avoid both long waits and ! * deadlocks using the same mechanism. */ void ResolveRecoveryConflictWithBufferPin(void) --- 354,368 * they hold one of the buffer pins that is blocking Startup process. If so, * backends will take an appropriate error action, ERROR or FATAL. * ! * We also check for deadlocks before we wait, though applications that cause ! * these will be extremely rare. Deadlocks occur because if queries * wait on a lock, that must be behind an AccessExclusiveLock, which can only ! * be cleared if the Startup process replays a transaction completion record. ! * If Startup process is also waiting then that is a deadlock. The deadlock ! * can occur if the query is waiting and then the Startup sleeps, or if ! * Startup is sleeping and the the query waits on a lock. We protect against ! * only the former sequence here, the latter sequence is checked prior to ! * the query sleeping, in CheckRecoveryConflictDeadlock(). */ void ResolveRecoveryConflictWithBufferPin(void) *** *** 368,378 ResolveRecoveryConflictWithBufferPin(void) Assert(InHotStandby); - /* - * Signal immediately or set alarm for later. - */ if (MaxStandbyDelay == 0) ! SendRecoveryConflictWithBufferPin(); else { TimestampTz now; --- 371,393 Assert(InHotStandby); if (MaxStandbyDelay == 0) ! { ! /* ! * We don't want to wait, so just tell everybody holding the pin to ! * get out of town. ! */ ! SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); ! } ! else if (MaxStandbyDelay == -1) ! { ! /* ! * Send out a request to check for buffer pin deadlocks before we wait. ! * This is fairly cheap, so no need to wait for deadlock timeout before ! * trying to send it out. ! */ ! SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); ! } else { TimestampTz now; *** *** 386,392 ResolveRecoveryConflictWithBufferPin(void) &standby_delay_secs, &standby_delay_usecs); if (standby_delay_secs >= MaxStandbyDelay) ! SendRecoveryConflictWithBufferPin(); else { TimestampTz fin_time; /* Expected wake-up time by timer */ --- 401,412 &standby_delay_secs, &standby_delay_usecs); if (standby_delay_secs >= MaxStandbyDelay) ! { ! /* ! * We're already behind, so clear a path as quickly as possible. ! */ ! SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); ! } else { TimestampTz fin_time; /* Expected wake-up time by timer */ *** *** 394,399 ResolveRecoveryConflictWithBufferPin(void) --- 414,426 int timer_delay_usecs = 0; /* + * Send out a request to check for buffer pin deadlocks b
Re: [HACKERS] Hot Standby and deadlock detection
On Mon, 2010-02-01 at 17:50 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Mon, 2010-02-01 at 09:40 +0200, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> The way this would work is if Startup waits on a buffer pin we > >>> immediately send out a request to all backends to cancel themselves if > >>> they are holding the buffer pin required && waiting on a lock. We then > >>> sleep until max_standby_delay. When max_standby_delay = -1 we only sleep > >>> until deadlock timeout and then check (on the Startup process). > >> Should wake up to check for deadlocks after deadlock_timeout also when > >> max_standby_delay > deadlock_timeout. max_standby_delay could be hours - > >> we want to detect a deadlock sooner than that. > > > > The patch does detect deadlocks sooner that that - "immediately", as > > described above. > > Umm, so why not run the deadlock check immediately in > max_standby_delay=-1 case as well? Why is that case handled differently > from max_standby_delay>0 case? Cos the code to do that is easy. I'll do the deadlock check immediately and make it even easier. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
On Mon, 2010-02-01 at 10:27 -0500, Tom Lane wrote: > Simon Riggs writes: > > On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote: > >> the assumption that the file is less than one disk block, > >> it should be just as atomic as pg_control updates are. > > > IIRC there were 173 relations affected by this. 4 bytes each we would > > have more than 512 bytes. > > Where in the world did you get that number? > > There are currently 29 shared relations (counting indexes), and 13 > nailed local relations, which would go into a different map file. > I'm not sure if the set of local catalogs requiring the map treatment > is exactly the same as what's presently nailed, but that's probably > a good approximation. I was suggesting that we only do shared and nailed relations. Sounds like you agree. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and deadlock detection
Simon Riggs wrote: > On Mon, 2010-02-01 at 09:40 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> The way this would work is if Startup waits on a buffer pin we >>> immediately send out a request to all backends to cancel themselves if >>> they are holding the buffer pin required && waiting on a lock. We then >>> sleep until max_standby_delay. When max_standby_delay = -1 we only sleep >>> until deadlock timeout and then check (on the Startup process). >> Should wake up to check for deadlocks after deadlock_timeout also when >> max_standby_delay > deadlock_timeout. max_standby_delay could be hours - >> we want to detect a deadlock sooner than that. > > The patch does detect deadlocks sooner that that - "immediately", as > described above. Umm, so why not run the deadlock check immediately in max_standby_delay=-1 case as well? Why is that case handled differently from max_standby_delay>0 case? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
Heikki Linnakangas writes: > Tom Lane wrote: >> The other problem with what you sketch is that it'd require holding the >> mapfile write lock across commit, because we still have to have strict >> serialization of updates. > Why is the strict serialization of updates needed? To avoid overwriting > the file with stale contents in a race condition? Exactly. > I was thinking that we only store the modified part in the WAL record. > Right after writing commit record, take the lock, read() the map file, > modify it in memory, write() it back, and release lock. > That means that there's no full images of the file in WAL records, which > makes me slightly uncomfortable from a disaster recovery point-of-view, Yeah, me too, which is another advantage of using a separate WAL entry. > That doesn't solve the problem I was trying to solve, which is that if > the map file is rewritten, but the transaction later aborts, the map > file update has hit the disk already. That's why I wanted to stash it > into the commit record. The design I sketched doesn't require such an assumption anyway. Once the map file is written, the relocation is effective, commit or no. As long as we restrict relocations to maintenance operations such as VACUUM FULL, which have no transactionally significant results, this doesn't seem like a problem. What we do need is that after a CLUSTER or V.F., which is going to relocate not only the rel but its indexes, the relocations of the rel and its indexes have to all "commit" atomically. But saving up the transaction's map changes and applying them in one write takes care of that. (What I believe this means is that pg_class and its indexes have to all be mapped, but I'm thinking right now that no other non-shared catalogs need the treatment.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
Tom Lane wrote: > That seems too fragile to me, as I don't find it a stretch at all to > think that writing the map file might fail --- just think Windows > antivirus code :-(. Now, once we have written the WAL record for > the mapfile change, we can't really afford a failure in my approach > either. But I think a rename() after successfully creating/writing/ > fsync'ing a temp file is a whole lot safer than writing from a standing > start. My gut feeling is exactly opposite. Creating and renaming a file involves operations (and permissions) on the directory, while overwriting a small file is just a simple write(). Especially if you open() the file before doing anything irreversible. > The other problem with what you sketch is that it'd require holding the > mapfile write lock across commit, because we still have to have strict > serialization of updates. Why is the strict serialization of updates needed? To avoid overwriting the file with stale contents in a race condition? I was thinking that we only store the modified part in the WAL record. Right after writing commit record, take the lock, read() the map file, modify it in memory, write() it back, and release lock. That means that there's no full images of the file in WAL records, which makes me slightly uncomfortable from a disaster recovery point-of-view, but we could keep a backup copy of the file in the data directory or something if that's too scary otherwise. > Maybe we should forget the > rename() trick and overwrite the map file in place. I still think it > needs to be a separate WAL record though. I'm thinking > > * obtain lock > * open file for read/write > * read current contents > * construct modified contents > * write and sync WAL record > * write back file through already-opened descriptor > * fsync > * release lock > > Not totally clear if this is more or less safe than the rename method; > but given the assumption that the file is less than one disk block, > it should be just as atomic as pg_control updates are. That doesn't solve the problem I was trying to solve, which is that if the map file is rewritten, but the transaction later aborts, the map file update has hit the disk already. That's why I wanted to stash it into the commit record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Standby and VACUUM FULL
Simon Riggs writes: > On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote: >> the assumption that the file is less than one disk block, >> it should be just as atomic as pg_control updates are. > IIRC there were 173 relations affected by this. 4 bytes each we would > have more than 512 bytes. Where in the world did you get that number? There are currently 29 shared relations (counting indexes), and 13 nailed local relations, which would go into a different map file. I'm not sure if the set of local catalogs requiring the map treatment is exactly the same as what's presently nailed, but that's probably a good approximation. At 8 bytes each (OID + relfilenode), we could fit 64 map entries in a standard disk sector --- let's say 63 so there's room for a header. So, barring more-than-doubling of the number of shared catalogs, there's not going to be a problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers