On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> > On 29.11.2010 08:10, Noah Misch wrote:
> > > I have a hot_standby system and use it to bear the load of various 
> > > reporting
> > > queries that take 15-60 minutes each.  In an effort to avoid long pauses 
> > > in
> > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three 
> > > hours of
> > > the master's transactions.  Even so, I kept seeing recovery pause for the
> > > duration of a long-running query.  In each case, the culprit record was an
> > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> > > attached test script demonstrates the behavior (on HEAD); the index tuple
> > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the 
> > > standby.
> > >
> > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and 
> > > remove
> > > the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid 
> > > does
> > > not regard the inserting-transaction outcome, so btree_redo proceeds to 
> > > conflict
> > > with snapshots having visibility over that transaction.  Could we 
> > > correctly
> > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore 
> > > tuples
> > > of aborted transactions and tuples inserted and deleted within one 
> > > transaction?
> 
> @Noah Easily the best bug reported submitted in a long time. Thanks.
> 
> > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need 
> > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() 
> > could just call HeapTupleHeaderAdvanceLatestRemoveXid().
> 
> Yes, it applies to other cases also. Thanks for the suggestion.
> 
> Fix committed. Please double-check my work, committed early since I'm
> about to jump on a plane.

The installation that inspired my original report recently upgraded from 9.0.1
to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
(FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
attached a test script demonstrating the behavior.  _bt_page_recyclable approves
any page deleted no more recently than RecentXmin, because we need only ensure
that every ongoing scan has witnessed the page as dead.  For the hot standby
case, we need to account for possibly-ongoing standby transactions.  Using
RecentGlobalXmin covers that, albeit with some pessimism: we really only need
LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
- vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
off, though.  Thoughts?

Thanks,
nm

Attachment: repro-btree-reuse.sh
Description: Bourne shell script

*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 678,683 **** bool
--- 678,684 ----
  _bt_page_recyclable(Page page)
  {
        BTPageOpaque opaque;
+       TransactionId xmin;
  
        /*
         * It's possible to find an all-zeroes page in an index --- for 
example, a
***************
*** 693,700 **** _bt_page_recyclable(Page page)
         * interested in it.
         */
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);
        if (P_ISDELETED(opaque) &&
!               TransactionIdPrecedesOrEquals(opaque->btpo.xact, RecentXmin))
                return true;
        return false;
  }
--- 694,702 ----
         * interested in it.
         */
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+       xmin = XLogStandbyInfoActive() ? RecentGlobalXmin : RecentXmin;
        if (P_ISDELETED(opaque) &&
!               TransactionIdPrecedesOrEquals(opaque->btpo.xact, xmin))
                return true;
        return false;
  }
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to