Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-11-17 Thread Alvaro Herrera
Alvaro Herrera wrote:
> I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
> They are pretty similar, but some adjustments were needed due to XLog
> format changes in 9.5.  (I kept most of Simon's original commit
> message.)

Finally done.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-11-16 Thread Alvaro Herrera
I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
They are pretty similar, but some adjustments were needed due to XLog
format changes in 9.5.  (I kept most of Simon's original commit
message.)

This patch has survived in master for a long time, and in released 9.6
for a couple of months now.  We have a couple of customers running with
this patch installed also (who make heavy use of their standbys),
without reported problems.  Moreover, the next minors for 9.4 and 9.5
are scheduled for February 2017, so unless some major security problem
pops up that prompts a more urgent update, we have three months to go
before this is released to the masses running 9.4 and 9.5; if a bug
crops up in the meantime, we have plenty of time to get it fixed.

While reviewing this I noticed a small thinko in the README, which I'm
going to patch in 9.6 and master thusly (with the same commit message):

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 067d15c..a3f11da 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -521,11 +521,12 @@ because it allows running applications to continue while 
the standby
 changes state into a normally running server.
 
 The interlocking required to avoid returning incorrect results from
-MVCC scans is not required on standby nodes. That is because
+non-MVCC scans is not required on standby nodes. That is because
 HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
 HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
 ever used during write transactions, which cannot exist on the standby.
-This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast().
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem.  That leaves concern only for HeapTupleSatisfiesToast().
 HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
 because it doesn't need to - if the main heap row is visible then the
 toast rows will also be visible. So as long as we follow a toast

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From caaba7a1db9d9981c8f93b6dda7d33979164845d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 15 Nov 2016 22:26:19 -0300
Subject: [PATCH] Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

This is a back-patch of commits 687f2cd7a015, 3e4b7d87988f, b60284261375
by Simon Riggs, to branches 9.4 and 9.5.  No further backpatch is
possible because this depends on catalog scans being MVCC.  I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
---
 src/backend/access/nbtree/README| 22 ++
 src/backend/access/nbtree/nbtree.c  | 15 +++
 src/backend/access/nbtree/nbtxlog.c | 23 +--
 src/include/access/nbtree.h |  6 --
 4 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4820f76..997d272 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -486,6 +486,28 @@ normal running after recovery has completed. This is a key 
capability
 because it allows running applications to continue while the standby
 changes state into a normally running server.
 
+The interlocking required to avoid returning incorrect results from
+non-MVCC scans is not required on standby nodes. That is because
+HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
+HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
+ever used during write transactions, which cannot exist on the standby.
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem.  That leaves concern only for HeapTupleSatisfiesToast().
+HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
+because it doesn't need to - if the main heap row is visible then the
+toast rows will also be visible. So as long as we follow a toast
+pointer from a visi

Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
>  wrote:
> > Robert Haas wrote:
> >> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
> >> >> This seems like a might subtle thing to backpatch. If we really want to
> >> >> go there, ISTM that the relevant code should stew in an unreleased
> >> >> branch for a while, before being backpatched.
> >> >
> >> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> >> > awhile.  If it survives six months or so then we could discuss it again.
> >>
> >> I agree with Tom.
> >
> > Okay, several months have passed with this in the development branch and
> > now seems a good time to backpatch this all the way back to 9.4.
> >
> > Any objections?
> 
> Although the code has now been in the tree for six months, it's only
> been in a released branch for three weeks, which is something about
> which to think.

The objection above was "stew in an unreleased branch", which it has.

> I guess what's really bothering me is that I don't think this is a bug
> fix.  It seems like a performance optimization.  And I think that we
> generally have a policy that we don't back-patch performance
> optimizations.  Of course, there is some fuzziness because when the
> performance gets sufficiently bad, we sometimes decide that it amounts
> to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7.  Maybe this
> case qualifies for similar treatment, but I'm not sure.

I have seen a number of situations in which the standby strangely lags
behind seemingly randomly sometimes for very long periods of time,
without any apparent cause, without any possible remedy, and it makes
the standby unusable.  If the user happens to monitor the lag they may
notice it and some may decide not to run certain queries.  In other
cases the I/O load is so bad that queries that otherwise run without a
hitch are stuck for long.

In my opinion, this is a serious problem.

> Why are you talking about back-patching this to 9.4 rather than all
> supported branches?

As far as I understand this is dependent on catalog scans being MVCC,
so it cannot be backpatched any further than that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-20 Thread Robert Haas
On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>
> Any objections?

Although the code has now been in the tree for six months, it's only
been in a released branch for three weeks, which is something about
which to think.

I guess what's really bothering me is that I don't think this is a bug
fix.  It seems like a performance optimization.  And I think that we
generally have a policy that we don't back-patch performance
optimizations.  Of course, there is some fuzziness because when the
performance gets sufficiently bad, we sometimes decide that it amounts
to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7.  Maybe this
case qualifies for similar treatment, but I'm not sure.

Why are you talking about back-patching this to 9.4 rather than all
supported branches?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-19 Thread Amit Kapila
On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>

Are you talking about commit -
3e4b7d87988f0835f137f15f5c1a40598dd21f3d?  If yes, do we want to
retain this code in its current form under define UNUSED, is there any
advantage of same.   Another point is that won't this commit make
information in xl_btree_vacuum redundant, so shouldn't we remove it
usage during WAL writing as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-19 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
> >> This seems like a might subtle thing to backpatch. If we really want to
> >> go there, ISTM that the relevant code should stew in an unreleased
> >> branch for a while, before being backpatched.
> >
> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> > awhile.  If it survives six months or so then we could discuss it again.
> 
> I agree with Tom.

Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.

Any objections?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-01-04 Thread Robert Haas
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane  wrote:
>> This seems like a might subtle thing to backpatch. If we really want to
>> go there, ISTM that the relevant code should stew in an unreleased
>> branch for a while, before being backpatched.
>
> I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
> awhile.  If it survives six months or so then we could discuss it again.

I agree with Tom.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-01-04 Thread Tom Lane
Andres Freund  writes:
> On 2016-01-03 15:40:01 +, Simon Riggs wrote:
>> I'm happy with this being a simple patch now, not least because I would
>> like to backpatch this to 9.4 where catalog scans became MVCC.
>> 
>> A backpatch is warranted because it is a severe performance issue with
>> replication and we can fix that before 9.5 hits the streets.
>> 
>> I'll be doing some more testing and checking, so not in a rush.

> This seems like a might subtle thing to backpatch. If we really want to
> go there, ISTM that the relevant code should stew in an unreleased
> branch for a while, before being backpatched.

I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
awhile.  If it survives six months or so then we could discuss it again.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-01-04 Thread Andres Freund
On 2016-01-03 15:40:01 +, Simon Riggs wrote:
> I'm happy with this being a simple patch now, not least because I would
> like to backpatch this to 9.4 where catalog scans became MVCC.
> 
> A backpatch is warranted because it is a severe performance issue with
> replication and we can fix that before 9.5 hits the streets.
> 
> I'll be doing some more testing and checking, so not in a rush.

This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-01-03 Thread Simon Riggs
On 21 December 2015 at 21:36, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > I think the new comment that talks about Toast Index should explain
> > *why* we can skip the pinning in all cases except that one, instead of
> > just saying we can do it.
>
> I've not looked at that code in a long while, but my recollection is
> that it's designed that way to protect non-MVCC catalog scans, which
> are gone now ... except for SnapshotToast.


Yes, that's the principle in use here. Which means we can optimize away the
scan on a Hot Standby in all cases except for Toast Index vacuums.


> Maybe the right way to
> approach this is to adjust HeapTupleSatisfiesToast (or maybe just
> convince ourselves that no one could be dereferencing a stale toast
> pointer in the first place?) and then redesign btree vacuuming without
> the requirement to support non-MVCC scans, period.
>

We could, but its likely to be a much more complex patch.

I'm happy with this being a simple patch now, not least because I would
like to backpatch this to 9.4 where catalog scans became MVCC.

A backpatch is warranted because it is a severe performance issue with
replication and we can fix that before 9.5 hits the streets.

I'll be doing some more testing and checking, so not in a rush.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-01-03 Thread Simon Riggs
On 21 December 2015 at 21:28, Alvaro Herrera 
wrote:

> Simon Riggs wrote:
> > During VACUUM of btrees, we need to pin all pages, even those where
> tuples
> > are not removed, which I am calling here the "pin scan". This is
> especially
> > a problem during redo, where removing one tuple from a 100GB btree can
> take
> > a minute or more. That causes replication lags. Bad Thing.
>
> Agreed on there being a problem here.
>
> As a reminder, this "pin scan" is implemented in the
> HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
> in charge of replaying an action recorded by _bt_delitems_vacuum.  That
> replay works by acquiring cleanup lock on all btree blocks from
> lastBlockVacuumed to "this one" (i.e. the one in which elements are
> being removed).  In essence, this means pinning *all* the blocks in the
> index, which is bad.
> The new code sets lastBlockVacuumed to Invalid; a new test in the replay
> code makes that value not pin anything.  This is supposed to optimize
> the case in question.
>

Nice summary, thanks.


> One thing that this patch should update is the comment above struct
> xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
> of the options to apply to each block, but that routine doesn't exist.
>

Updated


> One possible naive optimization is to avoid locking pages that aren't
> leaf pages, but that would still require fetching the block from disk,
> so it wouldn't actually optimize anything other than the cleanup lock
> acquisition.  (And probably not too useful, since leaf pages are said to
> be 95% - 99% of index pages anyway.)
>

Agreed, not worth it.


> Also of interest is the comment above the call to _bt_delitems_vacuum in
> btvacuumpage(); that needs an update too.
>

Updated


> It appears to me that your patch changes the call in btvacuumscan() but
> not the one in btvacuumpage().  I assume you should be changing both.
>

Yes, that was correct. Updated.


> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.
>

Greatly expanded comments.

Thanks for the review.

New patch attached.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


avoid_standby_pin_scan.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2015-12-21 Thread Tom Lane
Alvaro Herrera  writes:
> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.

I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.  Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2015-12-21 Thread Alvaro Herrera
Simon Riggs wrote:
> During VACUUM of btrees, we need to pin all pages, even those where tuples
> are not removed, which I am calling here the "pin scan". This is especially
> a problem during redo, where removing one tuple from a 100GB btree can take
> a minute or more. That causes replication lags. Bad Thing.

Agreed on there being a problem here.

As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum.  That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed).  In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything.  This is supposed to optimize
the case in question.

One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.

One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition.  (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)

Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.

It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage().  I assume you should be changing both.

I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers