subject:"Re\: Weird planner issue on a standby"

Re: Weird planner issue on a standby

2022-10-12 Thread Peter Geoghegan

On Wed, Oct 12, 2022 at 6:47 AM Tom Lane  wrote:
> However, that doesn't explain the downthread report that a
> VACUUM on the primary fixed it.  What I suspect is that that
> caused some in-fact-dead index entries to get cleaned out.

Seems likely.

> But ... if the primary is allowed to vacuum away an index
> entry that it thinks is dead, exactly what is the point of
> making standbys ignore LP_DEAD bits?  There's no additional
> interlock that guarantees the tuple will be there at all.

The interlock doesn't really protect the leaf page or its index tuples
so much as the referenced TIDs themselves. In other words it's a TID
recycling interlock.

That's why we don't need a cleanup lock to perform index tuple
deletions, even though the WAL records for those are almost identical
to the WAL records used by index vacuuming (in the case of nbtree the
only difference is the extra latestRemovedXid field in the deletion
variant WAL record). We know that there is no VACUUM process involved,
and no question of heap vacuuming going ahead for the same TIDs once
index vacuuming is allowed to complete.

We can get away with not having the interlock at all in the case of
nbtree index scans with MVCC snapshots -- but *not* with index-only
scans. See "Making concurrent TID recycling safe" in the nbtree
README. I only got around to documenting all of the details here quite
recently. The index-only scan thing dates back to 9.5.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-12 Thread Tom Lane

Peter Geoghegan  writes:
> That's true, but it doesn't matter whether or not there are LP_DEAD
> bits set on the standby, since in any case they cannot be trusted when
> in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
> set to false on the standby.

Hmm.  I think that might break this argument in get_actual_variable_endpoint:

 * A crucial point here is that SnapshotNonVacuumable, with
 * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the
 * condition that the indexscan will use to decide that index entries are
 * killable (see heap_hot_search_buffer()).  Therefore, if the snapshot
 * rejects a tuple (or more precisely, all tuples of a HOT chain) and we
 * have to continue scanning past it, we know that the indexscan will mark
 * that index entry killed.  That means that the next
 * get_actual_variable_endpoint() call will not have to re-consider that
 * index entry.  In this way we avoid repetitive work when this function
 * is used a lot during planning.

However, that doesn't explain the downthread report that a
VACUUM on the primary fixed it.  What I suspect is that that
caused some in-fact-dead index entries to get cleaned out.

But ... if the primary is allowed to vacuum away an index
entry that it thinks is dead, exactly what is the point of
making standbys ignore LP_DEAD bits?  There's no additional
interlock that guarantees the tuple will be there at all.

regards, tom lane

Re: Weird planner issue on a standby

2022-10-12 Thread Laurenz Albe

On Wed, 2022-10-12 at 10:51 +0200, Guillaume Lelarge wrote:
> Just finished my phone call. So, they definitely have their performance back. 
> All they did was a VACUUM on two tables.
> 
> If I understand correctly, during "normal" operations, some information is 
> stored on the primary
> and sent to standbys. For some reason, only the primary take them into 
> account, standbys ignore them.
> That would explain why, when we promoted a standby without doing anything 
> else, it had much better
> performance. VACUUM fixes the issue on a standby, probably by storing this 
> information in a different
> way. After VACUUM, standbys stop ignoring this information, which helps get 
> the performance back.
> 
> That sounds like a plausible explanation. I still have questions if you don't 
> mind:
> * what is this information?
> * where is it stored? my guess would be indexes
> * why is it ignored on standbys and used on primary?

That sounds indeed like killed (LP_DEAD) index tuples on the primary.
Peter says they are ignored on the standby anyway, so on the standby
PostgreSQL went through a lot of index entries pointing to dead table
tuples, and it took a long time to find the maximal entry in the table,
which is done by the optimizer.

VACUUM removed those dead tuples and their associated index entries
on both primary and standby.

Yours,
Laurenz Albe

Re: Weird planner issue on a standby

2022-10-12 Thread Guillaume Lelarge

Le mer. 12 oct. 2022 à 08:56, Guillaume Lelarge  a
écrit :

> Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge 
> a écrit :
>
>> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera  a
>> écrit :
>>
>>> On 2022-Oct-11, Tom Lane wrote:
>>>
>>> > Are there any tables in this query where extremal values of the join
>>> > key are likely to be in recently-added or recently-dead rows?  Does
>>> > VACUUM'ing on the primary help?
>>>
>>> I remember having an hypothesis, upon getting a report of this exact
>>> problem on a customer system once, that it could be due to killtuple not
>>> propagating to standbys except by FPIs.  I do not remember if we proved
>>> that true or not.  I do not remember observing that tables were being
>>> read, however.
>>>
>>>
>> Thanks for your answers.
>>
>> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have
>> no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
>> yesterday) is much probably recently-added. I can ask my customer if you
>> want but this looks like a pretty safe bet.
>>
>> On the VACUUM question, I didn't say, but we're kind of wondering if it
>> was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on
>> the database (and especially on the 1.6TB table which is part of the
>> query). I'm kind of skeptical because if the VACUUM wasn't enough on the
>> standby, it should be the same on the primary.
>>
>>
> It appears that I was wrong. I just got an email from my customer saying
> they got their performance back after a VACUUM on the two main tables of
> the query. I'll have them on the phone in about an hour. I'll probably know
> more then. Still wondering why it was an issue on the standby and not on
> the primary. VACUUM cleans up tables and indexes, and this activity goes
> through WAL, doesn't it?
>
>
Just finished my phone call. So, they definitely have their performance
back. All they did was a VACUUM on two tables.

If I understand correctly, during "normal" operations, some information is
stored on the primary and sent to standbys. For some reason, only the
primary take them into account, standbys ignore them. That would explain
why, when we promoted a standby without doing anything else, it had much
better performance. VACUUM fixes the issue on a standby, probably by
storing this information in a different way. After VACUUM, standbys stop
ignoring this information, which helps get the performance back.

That sounds like a plausible explanation. I still have questions if you
don't mind:
* what is this information?
* where is it stored? my guess would be indexes
* why is it ignored on standbys and used on primary?

We didn't talk much about releases, so I guess that the
"standby-ignores-some-information" part is currently on all available
releases?

Thank you.

-- 
Guillaume.

Re: Weird planner issue on a standby

2022-10-11 Thread Guillaume Lelarge

Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge  a
écrit :

> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera  a
> écrit :
>
>> On 2022-Oct-11, Tom Lane wrote:
>>
>> > Are there any tables in this query where extremal values of the join
>> > key are likely to be in recently-added or recently-dead rows?  Does
>> > VACUUM'ing on the primary help?
>>
>> I remember having an hypothesis, upon getting a report of this exact
>> problem on a customer system once, that it could be due to killtuple not
>> propagating to standbys except by FPIs.  I do not remember if we proved
>> that true or not.  I do not remember observing that tables were being
>> read, however.
>>
>>
> Thanks for your answers.
>
> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have
> no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
> yesterday) is much probably recently-added. I can ask my customer if you
> want but this looks like a pretty safe bet.
>
> On the VACUUM question, I didn't say, but we're kind of wondering if it
> was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on
> the database (and especially on the 1.6TB table which is part of the
> query). I'm kind of skeptical because if the VACUUM wasn't enough on the
> standby, it should be the same on the primary.
>
>
It appears that I was wrong. I just got an email from my customer saying
they got their performance back after a VACUUM on the two main tables of
the query. I'll have them on the phone in about an hour. I'll probably know
more then. Still wondering why it was an issue on the standby and not on
the primary. VACUUM cleans up tables and indexes, and this activity goes
through WAL, doesn't it?


-- 
Guillaume.

Re: Weird planner issue on a standby

2022-10-11 Thread Guillaume Lelarge

Le mer. 12 oct. 2022 à 06:08, Ron  a écrit :

> On 10/11/22 22:35, Julien Rouhaud wrote:
> > On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
> >> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera 
> a
> >> écrit :
> >>
> >>> On 2022-Oct-11, Tom Lane wrote:
> >>>
>  Are there any tables in this query where extremal values of the join
>  key are likely to be in recently-added or recently-dead rows?  Does
>  VACUUM'ing on the primary help?
> >>> I remember having an hypothesis, upon getting a report of this exact
> >>> problem on a customer system once, that it could be due to killtuple
> not
> >>> propagating to standbys except by FPIs.  I do not remember if we proved
> >>> that true or not.  I do not remember observing that tables were being
> >>> read, however.
> >>>
> >>>
> >> Thanks for your answers.
> >>
> >> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I
> have no
> >> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
> >> yesterday) is much probably recently-added. I can ask my customer if you
> >> want but this looks like a pretty safe bet.
> >>
> >> On the VACUUM question, I didn't say, but we're kind of wondering if it
> was
> >> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
> >> database (and especially on the 1.6TB table which is part of the query).
> >> I'm kind of skeptical because if the VACUUM wasn't enough on the
> standby,
> >> it should be the same on the primary.
> >>
> >> Actually, there are two things that really bug me:
> >> * why the difference between primary and both standbys?
> >> * why now? (it worked great before this weekend, and the only thing I
> know
> >> happened before is a batch delete on sunday... which may be a
> good-enough
> >> reason for things to get screwed, but once again, why only both
> standbys?)
> >>
> >> Julien Rouhaud also told me about killtuples, but I have no idea what
> they
> >> are. I suppose this is different from dead tuples. Anyway, if you can
> >> enlighten me, I'll be happy :)
> > That's an optimisation where an index scan can mark an index entry as
> dead
> > (LP_DEAD) if if tries to fetch some data from the heap that turns out to
> be all
> > dead, so further scans won't have to check again (you can grep
> kill_prior_tuple
> > in the source for more details).  As that's a hint bit, it may not be
> > replicated unless you enable wal_log_hints or data_checksums (or write
> it as a
> > FPI indeed), which could explain discrepancy between primary (after a
> first
> > slow index scan) and standby nodes.
> >
> > But since your customer recreated their standbys from scratch *after*
> that
> > delete, all the nodes should have those hint bits set (Guillaume
> confirmed
> > off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
> > confirmed off-list that the customer has checksums enabled, which means
> that
> > MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty,
> so I'm
> > out of ideas to explain the different behavior on standbys.
>
> Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show
> any
> differences?
>
>
No differences.


-- 
Guillaume.

Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan

On Tue, Oct 11, 2022 at 10:04 PM Tom Lane  wrote:
> Do we propagate visibility-map bits to standbys?

Yes.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan

On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera  wrote:
> I remember having an hypothesis, upon getting a report of this exact
> problem on a customer system once, that it could be due to killtuple not
> propagating to standbys except by FPIs.  I do not remember if we proved
> that true or not.  I do not remember observing that tables were being
> read, however.

That's true, but it doesn't matter whether or not there are LP_DEAD
bits set on the standby, since in any case they cannot be trusted when
in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
set to false on the standby.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-11 Thread Tom Lane

Julien Rouhaud  writes:
> But since your customer recreated their standbys from scratch *after* that
> delete, all the nodes should have those hint bits set (Guillaume confirmed
> off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
> confirmed off-list that the customer has checksums enabled, which means that
> MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so 
> I'm
> out of ideas to explain the different behavior on standbys.

Do we propagate visibility-map bits to standbys?

regards, tom lane

Re: Weird planner issue on a standby

2022-10-11 Thread Ron


On 10/11/22 22:35, Julien Rouhaud wrote:

On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:

Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera  a
écrit :


On 2022-Oct-11, Tom Lane wrote:


Are there any tables in this query where extremal values of the join
key are likely to be in recently-added or recently-dead rows?  Does
VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs.  I do not remember if we proved
that true or not.  I do not remember observing that tables were being
read, however.



Thanks for your answers.

The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
yesterday) is much probably recently-added. I can ask my customer if you
want but this looks like a pretty safe bet.

On the VACUUM question, I didn't say, but we're kind of wondering if it was
lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
database (and especially on the 1.6TB table which is part of the query).
I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
it should be the same on the primary.

Actually, there are two things that really bug me:
* why the difference between primary and both standbys?
* why now? (it worked great before this weekend, and the only thing I know
happened before is a batch delete on sunday... which may be a good-enough
reason for things to get screwed, but once again, why only both standbys?)

Julien Rouhaud also told me about killtuples, but I have no idea what they
are. I suppose this is different from dead tuples. Anyway, if you can
enlighten me, I'll be happy :)

That's an optimisation where an index scan can mark an index entry as dead
(LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
dead, so further scans won't have to check again (you can grep kill_prior_tuple
in the source for more details).  As that's a hint bit, it may not be
replicated unless you enable wal_log_hints or data_checksums (or write it as a
FPI indeed), which could explain discrepancy between primary (after a first
slow index scan) and standby nodes.

But since your customer recreated their standbys from scratch *after* that
delete, all the nodes should have those hint bits set (Guillaume confirmed
off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
confirmed off-list that the customer has checksums enabled, which means that
MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
out of ideas to explain the different behavior on standbys.


Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any 
differences?


--
Angular momentum makes the world go 'round.

Re: Weird planner issue on a standby

2022-10-11 Thread Julien Rouhaud

On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera  a
> écrit :
> 
> > On 2022-Oct-11, Tom Lane wrote:
> >
> > > Are there any tables in this query where extremal values of the join
> > > key are likely to be in recently-added or recently-dead rows?  Does
> > > VACUUM'ing on the primary help?
> >
> > I remember having an hypothesis, upon getting a report of this exact
> > problem on a customer system once, that it could be due to killtuple not
> > propagating to standbys except by FPIs.  I do not remember if we proved
> > that true or not.  I do not remember observing that tables were being
> > read, however.
> >
> >
> Thanks for your answers.
> 
> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
> yesterday) is much probably recently-added. I can ask my customer if you
> want but this looks like a pretty safe bet.
> 
> On the VACUUM question, I didn't say, but we're kind of wondering if it was
> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
> database (and especially on the 1.6TB table which is part of the query).
> I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
> it should be the same on the primary.
> 
> Actually, there are two things that really bug me:
> * why the difference between primary and both standbys?
> * why now? (it worked great before this weekend, and the only thing I know
> happened before is a batch delete on sunday... which may be a good-enough
> reason for things to get screwed, but once again, why only both standbys?)
> 
> Julien Rouhaud also told me about killtuples, but I have no idea what they
> are. I suppose this is different from dead tuples. Anyway, if you can
> enlighten me, I'll be happy :)

That's an optimisation where an index scan can mark an index entry as dead
(LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
dead, so further scans won't have to check again (you can grep kill_prior_tuple
in the source for more details).  As that's a hint bit, it may not be
replicated unless you enable wal_log_hints or data_checksums (or write it as a
FPI indeed), which could explain discrepancy between primary (after a first
slow index scan) and standby nodes.

But since your customer recreated their standbys from scratch *after* that
delete, all the nodes should have those hint bits set (Guillaume confirmed
off-list that they used a fresh BASE_BACKUP).  Note that Guillaume also
confirmed off-list that the customer has checksums enabled, which means that
MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
out of ideas to explain the different behavior on standbys.

Re: Weird planner issue on a standby

2022-10-11 Thread Guillaume Lelarge

Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera  a
écrit :

> On 2022-Oct-11, Tom Lane wrote:
>
> > Are there any tables in this query where extremal values of the join
> > key are likely to be in recently-added or recently-dead rows?  Does
> > VACUUM'ing on the primary help?
>
> I remember having an hypothesis, upon getting a report of this exact
> problem on a customer system once, that it could be due to killtuple not
> propagating to standbys except by FPIs.  I do not remember if we proved
> that true or not.  I do not remember observing that tables were being
> read, however.
>
>
Thanks for your answers.

The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
yesterday) is much probably recently-added. I can ask my customer if you
want but this looks like a pretty safe bet.

On the VACUUM question, I didn't say, but we're kind of wondering if it was
lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
database (and especially on the 1.6TB table which is part of the query).
I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
it should be the same on the primary.

Actually, there are two things that really bug me:
* why the difference between primary and both standbys?
* why now? (it worked great before this weekend, and the only thing I know
happened before is a batch delete on sunday... which may be a good-enough
reason for things to get screwed, but once again, why only both standbys?)

Julien Rouhaud also told me about killtuples, but I have no idea what they
are. I suppose this is different from dead tuples. Anyway, if you can
enlighten me, I'll be happy :)

-- 
Guillaume.

Re: Weird planner issue on a standby

2022-10-11 Thread Alvaro Herrera

On 2022-Oct-11, Tom Lane wrote:

> Are there any tables in this query where extremal values of the join
> key are likely to be in recently-added or recently-dead rows?  Does
> VACUUM'ing on the primary help?

I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs.  I do not remember if we proved
that true or not.  I do not remember observing that tables were being
read, however.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Weird planner issue on a standby

2022-10-11 Thread Tom Lane

Guillaume Lelarge  writes:
> * The planner seems to read tables to generate the plans and/or select the
> right plan. Why does it do that? I thought it only reads indexes if needed,
> but it is quite clear it reads tables also.

I'm guessing to some extent, but it seems plausible that this is an
artifact of the logic that tries to get the current min and/or max column
values to be sure we are estimating mergejoin costs accurately.  Normally,
yeah, that is only supposed to consult the extremal values in indexes
and therefore not take very long.  However, we've had to go back to the
drawing board several times to fix cases where it made a lot of expensive
table accesses because the extremal index entries were of uncertain
liveness.  That behavior could very easily depend on whether you're in
a primary or standby server.

Are there any tables in this query where extremal values of the join
key are likely to be in recently-added or recently-dead rows?  Does
VACUUM'ing on the primary help?

And, not to put too fine a point on it, but does updating to a recent
minor release help?  In a quick scan of the git history I don't see any
related bugfixes later than 11.5, but I might've missed something.

regards, tom lane

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

Re: Weird planner issue on a standby

14 matches

Site Navigation

Mail list logo

Footer information