On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> > On 12.03.2011 12:40, Noah Misch wrote:
> >> The installation that inspired my original report recently upgraded from 
> >> 9.0.1
> >> to 9.0.3, and your fix did significantly decrease its conflict frequency.  
> >> The
> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE 
> >> records.
> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
> >> attached a test script demonstrating the behavior.  _bt_page_recyclable 
> >> approves
> >> any page deleted no more recently than RecentXmin, because we need only 
> >> ensure
> >> that every ongoing scan has witnessed the page as dead.  For the hot 
> >> standby
> >> case, we need to account for possibly-ongoing standby transactions.  Using
> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only 
> >> need
> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of 
> >> walsender_N)
> >> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would 
> >> pay
> >> off, though.  Thoughts?
> >
> > Hmm, instead of bloating the master, I wonder if we could detect more  
> > accurately if there are any on-going scans, in the standby. For example,  
> > you must hold a lock on the index to scan it, so only transactions  
> > holding the lock need to be checked for conflict.
> 
> That would be nice.  Do you have an outline of an implementation in mind?

In an attempt to resuscitate this thread, here's my own shot at that.  Apologies
in advance if it's just an already-burning straw man.

I didn't see any way to take advantage of checking for the heavyweight lock that
any index scan would need to hold.  Such a check becomes obsolete the moment
it's done -- nothing stops another locker from arriving between the completion
of your check and whatever you do based on its outcome.  When a standby is in
the picture, the answer needs to be decided at xlog generation time but valid
for xlog apply time.  Therefore, I instead looked for what could be done with
buffer locks.


Regarding the underlying requirement, here is the relevant paragraph from
src/backend/access/nbtree/README:

  A deleted page can only be reclaimed once there is no scan or search that
  has a reference to it; until then, it must stay in place with its
  right-link undisturbed.  We implement this by waiting until all
  transactions that were running at the time of deletion are dead; which is
  overly strong, but is simple to implement within Postgres.  When marked
  dead, a deleted page is labeled with the next-transaction counter value.
  VACUUM can reclaim the page for re-use when this transaction number is
  older than the oldest open transaction.

These threads also have useful background as concerns hot standby:
http://archives.postgresql.org/message-id/23761.1265596...@sss.pgh.pa.us
http://archives.postgresql.org/message-id/4b7d3135.3090...@enterprisedb.com
http://archives.postgresql.org/message-id/1268123581.10620.76.camel@ebony

This is implemented in two passes.  After updating the adjacent page links, the
first pass stores ReadNewTransactionId() in the page.  The second pass finishes
the reuse if the stored xid is older than RecentXmin.  This has the benefits of
simplicity and low contention.  However, any long-running transaction delays
reuse.  Also, since RecentXmin's witness of "all transactions that were running
at the time of deletion" only includes master transactions, our tools for
preventing recovery conflicts (vacuum_defer_cleanup_age, feedback) do not defend
against conflicts arising from b-tree page reuse.  Fixing that means either
choosing different page reuse criteria or letting standby transactions also
delay reuse on the master.

If I understand correctly, "has a reference to it" implies a pin on the left,
right or parent page.  By the time _bt_pagedel is ready to mark the page
BTP_DELETED, we already hold exclusive content locks on all of those buffers.
Suppose we then check the pin count on each.  If they have only single local
pins, we have serendipitously acquired cleanup locks: set btpo.xact to
FrozenTransactionId, allowing reclamation at any time.  Otherwise, continue as
the code stands today.  Also change _bt_recycleable to use RecentGlobalXmin
instead of RecentXmin; this avoids the recovery conflicts.  Since, we hope, the
vast majority of deleted pages will get FrozenTransactionId, the marginal bloat
added by using RecentGlobalXmin will not be significant.  This also saves a
LW_SHARED acquisition of XidGenLock in the uncontended case -- probably not
significant, but it can't hurt.

When the standby replays the XLOG_BTREE_DELETE_PAGE with btpo_xact ==
FrozenTransactionId, it must wait for a cleanup lock on all the buffers it
updates.  We could actually do this anytime between the XLOG_BTREE_DELETE_PAGE
and the later XLOG_BTREE_REUSE_PAGE, but we only have the necessary information
handy when applying the first record.

Something this intrusive is clearly 9.2 material.  It would be nice to have a
backpatchable strategy to prop up vacuum_defer_cleanup_age and
hot_standby_feedback in 9.0 and 9.1.  For that, I haven't come up with anything
better than my original pair of proposals.

Comments?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to