On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote: > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote: > > On 12.03.2011 12:40, Noah Misch wrote: > >> The installation that inspired my original report recently upgraded from > >> 9.0.1 > >> to 9.0.3, and your fix did significantly decrease its conflict frequency. > >> The > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE > >> records. > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've > >> attached a test script demonstrating the behavior. _bt_page_recyclable > >> approves > >> any page deleted no more recently than RecentXmin, because we need only > >> ensure > >> that every ongoing scan has witnessed the page as dead. For the hot > >> standby > >> case, we need to account for possibly-ongoing standby transactions. Using > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only > >> need > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of > >> walsender_N) > >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would > >> pay > >> off, though. Thoughts? > > > > Hmm, instead of bloating the master, I wonder if we could detect more > > accurately if there are any on-going scans, in the standby. For example, > > you must hold a lock on the index to scan it, so only transactions > > holding the lock need to be checked for conflict. > > That would be nice. Do you have an outline of an implementation in mind?
In an attempt to resuscitate this thread, here's my own shot at that. Apologies in advance if it's just an already-burning straw man. I didn't see any way to take advantage of checking for the heavyweight lock that any index scan would need to hold. Such a check becomes obsolete the moment it's done -- nothing stops another locker from arriving between the completion of your check and whatever you do based on its outcome. When a standby is in the picture, the answer needs to be decided at xlog generation time but valid for xlog apply time. Therefore, I instead looked for what could be done with buffer locks. Regarding the underlying requirement, here is the relevant paragraph from src/backend/access/nbtree/README: A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must stay in place with its right-link undisturbed. We implement this by waiting until all transactions that were running at the time of deletion are dead; which is overly strong, but is simple to implement within Postgres. When marked dead, a deleted page is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this transaction number is older than the oldest open transaction. These threads also have useful background as concerns hot standby: http://archives.postgresql.org/message-id/23761.1265596...@sss.pgh.pa.us http://archives.postgresql.org/message-id/4b7d3135.3090...@enterprisedb.com http://archives.postgresql.org/message-id/1268123581.10620.76.camel@ebony This is implemented in two passes. After updating the adjacent page links, the first pass stores ReadNewTransactionId() in the page. The second pass finishes the reuse if the stored xid is older than RecentXmin. This has the benefits of simplicity and low contention. However, any long-running transaction delays reuse. Also, since RecentXmin's witness of "all transactions that were running at the time of deletion" only includes master transactions, our tools for preventing recovery conflicts (vacuum_defer_cleanup_age, feedback) do not defend against conflicts arising from b-tree page reuse. Fixing that means either choosing different page reuse criteria or letting standby transactions also delay reuse on the master. If I understand correctly, "has a reference to it" implies a pin on the left, right or parent page. By the time _bt_pagedel is ready to mark the page BTP_DELETED, we already hold exclusive content locks on all of those buffers. Suppose we then check the pin count on each. If they have only single local pins, we have serendipitously acquired cleanup locks: set btpo.xact to FrozenTransactionId, allowing reclamation at any time. Otherwise, continue as the code stands today. Also change _bt_recycleable to use RecentGlobalXmin instead of RecentXmin; this avoids the recovery conflicts. Since, we hope, the vast majority of deleted pages will get FrozenTransactionId, the marginal bloat added by using RecentGlobalXmin will not be significant. This also saves a LW_SHARED acquisition of XidGenLock in the uncontended case -- probably not significant, but it can't hurt. When the standby replays the XLOG_BTREE_DELETE_PAGE with btpo_xact == FrozenTransactionId, it must wait for a cleanup lock on all the buffers it updates. We could actually do this anytime between the XLOG_BTREE_DELETE_PAGE and the later XLOG_BTREE_REUSE_PAGE, but we only have the necessary information handy when applying the first record. Something this intrusive is clearly 9.2 material. It would be nice to have a backpatchable strategy to prop up vacuum_defer_cleanup_age and hot_standby_feedback in 9.0 and 9.1. For that, I haven't come up with anything better than my original pair of proposals. Comments? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers