Re: Weird planner issue on a standby
On Wed, Oct 12, 2022 at 6:47 AM Tom Lane wrote: > However, that doesn't explain the downthread report that a > VACUUM on the primary fixed it. What I suspect is that that > caused some in-fact-dead index entries to get cleaned out. Seems likely. > But ... if the primary is allowed to vacuum away an index > entry that it thinks is dead, exactly what is the point of > making standbys ignore LP_DEAD bits? There's no additional > interlock that guarantees the tuple will be there at all. The interlock doesn't really protect the leaf page or its index tuples so much as the referenced TIDs themselves. In other words it's a TID recycling interlock. That's why we don't need a cleanup lock to perform index tuple deletions, even though the WAL records for those are almost identical to the WAL records used by index vacuuming (in the case of nbtree the only difference is the extra latestRemovedXid field in the deletion variant WAL record). We know that there is no VACUUM process involved, and no question of heap vacuuming going ahead for the same TIDs once index vacuuming is allowed to complete. We can get away with not having the interlock at all in the case of nbtree index scans with MVCC snapshots -- but *not* with index-only scans. See "Making concurrent TID recycling safe" in the nbtree README. I only got around to documenting all of the details here quite recently. The index-only scan thing dates back to 9.5. -- Peter Geoghegan
Re: Weird planner issue on a standby
Peter Geoghegan writes: > That's true, but it doesn't matter whether or not there are LP_DEAD > bits set on the standby, since in any case they cannot be trusted when > in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be > set to false on the standby. Hmm. I think that might break this argument in get_actual_variable_endpoint: * A crucial point here is that SnapshotNonVacuumable, with * GlobalVisTestFor(heapRel) as horizon, yields the inverse of the * condition that the indexscan will use to decide that index entries are * killable (see heap_hot_search_buffer()). Therefore, if the snapshot * rejects a tuple (or more precisely, all tuples of a HOT chain) and we * have to continue scanning past it, we know that the indexscan will mark * that index entry killed. That means that the next * get_actual_variable_endpoint() call will not have to re-consider that * index entry. In this way we avoid repetitive work when this function * is used a lot during planning. However, that doesn't explain the downthread report that a VACUUM on the primary fixed it. What I suspect is that that caused some in-fact-dead index entries to get cleaned out. But ... if the primary is allowed to vacuum away an index entry that it thinks is dead, exactly what is the point of making standbys ignore LP_DEAD bits? There's no additional interlock that guarantees the tuple will be there at all. regards, tom lane
Re: Weird planner issue on a standby
On Wed, 2022-10-12 at 10:51 +0200, Guillaume Lelarge wrote: > Just finished my phone call. So, they definitely have their performance back. > All they did was a VACUUM on two tables. > > If I understand correctly, during "normal" operations, some information is > stored on the primary > and sent to standbys. For some reason, only the primary take them into > account, standbys ignore them. > That would explain why, when we promoted a standby without doing anything > else, it had much better > performance. VACUUM fixes the issue on a standby, probably by storing this > information in a different > way. After VACUUM, standbys stop ignoring this information, which helps get > the performance back. > > That sounds like a plausible explanation. I still have questions if you don't > mind: > * what is this information? > * where is it stored? my guess would be indexes > * why is it ignored on standbys and used on primary? That sounds indeed like killed (LP_DEAD) index tuples on the primary. Peter says they are ignored on the standby anyway, so on the standby PostgreSQL went through a lot of index entries pointing to dead table tuples, and it took a long time to find the maximal entry in the table, which is done by the optimizer. VACUUM removed those dead tuples and their associated index entries on both primary and standby. Yours, Laurenz Albe
Re: Weird planner issue on a standby
Le mer. 12 oct. 2022 à 08:56, Guillaume Lelarge a écrit : > Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge > a écrit : > >> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera a >> écrit : >> >>> On 2022-Oct-11, Tom Lane wrote: >>> >>> > Are there any tables in this query where extremal values of the join >>> > key are likely to be in recently-added or recently-dead rows? Does >>> > VACUUM'ing on the primary help? >>> >>> I remember having an hypothesis, upon getting a report of this exact >>> problem on a customer system once, that it could be due to killtuple not >>> propagating to standbys except by FPIs. I do not remember if we proved >>> that true or not. I do not remember observing that tables were being >>> read, however. >>> >>> >> Thanks for your answers. >> >> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have >> no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is >> yesterday) is much probably recently-added. I can ask my customer if you >> want but this looks like a pretty safe bet. >> >> On the VACUUM question, I didn't say, but we're kind of wondering if it >> was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on >> the database (and especially on the 1.6TB table which is part of the >> query). I'm kind of skeptical because if the VACUUM wasn't enough on the >> standby, it should be the same on the primary. >> >> > It appears that I was wrong. I just got an email from my customer saying > they got their performance back after a VACUUM on the two main tables of > the query. I'll have them on the phone in about an hour. I'll probably know > more then. Still wondering why it was an issue on the standby and not on > the primary. VACUUM cleans up tables and indexes, and this activity goes > through WAL, doesn't it? > > Just finished my phone call. So, they definitely have their performance back. All they did was a VACUUM on two tables. If I understand correctly, during "normal" operations, some information is stored on the primary and sent to standbys. For some reason, only the primary take them into account, standbys ignore them. That would explain why, when we promoted a standby without doing anything else, it had much better performance. VACUUM fixes the issue on a standby, probably by storing this information in a different way. After VACUUM, standbys stop ignoring this information, which helps get the performance back. That sounds like a plausible explanation. I still have questions if you don't mind: * what is this information? * where is it stored? my guess would be indexes * why is it ignored on standbys and used on primary? We didn't talk much about releases, so I guess that the "standby-ignores-some-information" part is currently on all available releases? Thank you. -- Guillaume.
Re: Weird planner issue on a standby
Le mar. 11 oct. 2022 à 19:42, Guillaume Lelarge a écrit : > Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera a > écrit : > >> On 2022-Oct-11, Tom Lane wrote: >> >> > Are there any tables in this query where extremal values of the join >> > key are likely to be in recently-added or recently-dead rows? Does >> > VACUUM'ing on the primary help? >> >> I remember having an hypothesis, upon getting a report of this exact >> problem on a customer system once, that it could be due to killtuple not >> propagating to standbys except by FPIs. I do not remember if we proved >> that true or not. I do not remember observing that tables were being >> read, however. >> >> > Thanks for your answers. > > The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have > no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is > yesterday) is much probably recently-added. I can ask my customer if you > want but this looks like a pretty safe bet. > > On the VACUUM question, I didn't say, but we're kind of wondering if it > was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on > the database (and especially on the 1.6TB table which is part of the > query). I'm kind of skeptical because if the VACUUM wasn't enough on the > standby, it should be the same on the primary. > > It appears that I was wrong. I just got an email from my customer saying they got their performance back after a VACUUM on the two main tables of the query. I'll have them on the phone in about an hour. I'll probably know more then. Still wondering why it was an issue on the standby and not on the primary. VACUUM cleans up tables and indexes, and this activity goes through WAL, doesn't it? -- Guillaume.
Re: Weird planner issue on a standby
Le mer. 12 oct. 2022 à 06:08, Ron a écrit : > On 10/11/22 22:35, Julien Rouhaud wrote: > > On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote: > >> Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera > a > >> écrit : > >> > >>> On 2022-Oct-11, Tom Lane wrote: > >>> > Are there any tables in this query where extremal values of the join > key are likely to be in recently-added or recently-dead rows? Does > VACUUM'ing on the primary help? > >>> I remember having an hypothesis, upon getting a report of this exact > >>> problem on a customer system once, that it could be due to killtuple > not > >>> propagating to standbys except by FPIs. I do not remember if we proved > >>> that true or not. I do not remember observing that tables were being > >>> read, however. > >>> > >>> > >> Thanks for your answers. > >> > >> The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I > have no > >> idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is > >> yesterday) is much probably recently-added. I can ask my customer if you > >> want but this looks like a pretty safe bet. > >> > >> On the VACUUM question, I didn't say, but we're kind of wondering if it > was > >> lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the > >> database (and especially on the 1.6TB table which is part of the query). > >> I'm kind of skeptical because if the VACUUM wasn't enough on the > standby, > >> it should be the same on the primary. > >> > >> Actually, there are two things that really bug me: > >> * why the difference between primary and both standbys? > >> * why now? (it worked great before this weekend, and the only thing I > know > >> happened before is a batch delete on sunday... which may be a > good-enough > >> reason for things to get screwed, but once again, why only both > standbys?) > >> > >> Julien Rouhaud also told me about killtuples, but I have no idea what > they > >> are. I suppose this is different from dead tuples. Anyway, if you can > >> enlighten me, I'll be happy :) > > That's an optimisation where an index scan can mark an index entry as > dead > > (LP_DEAD) if if tries to fetch some data from the heap that turns out to > be all > > dead, so further scans won't have to check again (you can grep > kill_prior_tuple > > in the source for more details). As that's a hint bit, it may not be > > replicated unless you enable wal_log_hints or data_checksums (or write > it as a > > FPI indeed), which could explain discrepancy between primary (after a > first > > slow index scan) and standby nodes. > > > > But since your customer recreated their standbys from scratch *after* > that > > delete, all the nodes should have those hint bits set (Guillaume > confirmed > > off-list that they used a fresh BASE_BACKUP). Note that Guillaume also > > confirmed off-list that the customer has checksums enabled, which means > that > > MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, > so I'm > > out of ideas to explain the different behavior on standbys. > > Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show > any > differences? > > No differences. -- Guillaume.
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 10:04 PM Tom Lane wrote: > Do we propagate visibility-map bits to standbys? Yes. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera wrote: > I remember having an hypothesis, upon getting a report of this exact > problem on a customer system once, that it could be due to killtuple not > propagating to standbys except by FPIs. I do not remember if we proved > that true or not. I do not remember observing that tables were being > read, however. That's true, but it doesn't matter whether or not there are LP_DEAD bits set on the standby, since in any case they cannot be trusted when in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be set to false on the standby. -- Peter Geoghegan
Re: Weird planner issue on a standby
Julien Rouhaud writes: > But since your customer recreated their standbys from scratch *after* that > delete, all the nodes should have those hint bits set (Guillaume confirmed > off-list that they used a fresh BASE_BACKUP). Note that Guillaume also > confirmed off-list that the customer has checksums enabled, which means that > MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so > I'm > out of ideas to explain the different behavior on standbys. Do we propagate visibility-map bits to standbys? regards, tom lane
Re: Weird planner issue on a standby
On 10/11/22 22:35, Julien Rouhaud wrote: On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote: Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera a écrit : On 2022-Oct-11, Tom Lane wrote: Are there any tables in this query where extremal values of the join key are likely to be in recently-added or recently-dead rows? Does VACUUM'ing on the primary help? I remember having an hypothesis, upon getting a report of this exact problem on a customer system once, that it could be due to killtuple not propagating to standbys except by FPIs. I do not remember if we proved that true or not. I do not remember observing that tables were being read, however. Thanks for your answers. The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet. On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary. Actually, there are two things that really bug me: * why the difference between primary and both standbys? * why now? (it worked great before this weekend, and the only thing I know happened before is a batch delete on sunday... which may be a good-enough reason for things to get screwed, but once again, why only both standbys?) Julien Rouhaud also told me about killtuples, but I have no idea what they are. I suppose this is different from dead tuples. Anyway, if you can enlighten me, I'll be happy :) That's an optimisation where an index scan can mark an index entry as dead (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all dead, so further scans won't have to check again (you can grep kill_prior_tuple in the source for more details). As that's a hint bit, it may not be replicated unless you enable wal_log_hints or data_checksums (or write it as a FPI indeed), which could explain discrepancy between primary (after a first slow index scan) and standby nodes. But since your customer recreated their standbys from scratch *after* that delete, all the nodes should have those hint bits set (Guillaume confirmed off-list that they used a fresh BASE_BACKUP). Note that Guillaume also confirmed off-list that the customer has checksums enabled, which means that MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm out of ideas to explain the different behavior on standbys. Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any differences? -- Angular momentum makes the world go 'round.
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote: > Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera a > écrit : > > > On 2022-Oct-11, Tom Lane wrote: > > > > > Are there any tables in this query where extremal values of the join > > > key are likely to be in recently-added or recently-dead rows? Does > > > VACUUM'ing on the primary help? > > > > I remember having an hypothesis, upon getting a report of this exact > > problem on a customer system once, that it could be due to killtuple not > > propagating to standbys except by FPIs. I do not remember if we proved > > that true or not. I do not remember observing that tables were being > > read, however. > > > > > Thanks for your answers. > > The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no > idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is > yesterday) is much probably recently-added. I can ask my customer if you > want but this looks like a pretty safe bet. > > On the VACUUM question, I didn't say, but we're kind of wondering if it was > lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the > database (and especially on the 1.6TB table which is part of the query). > I'm kind of skeptical because if the VACUUM wasn't enough on the standby, > it should be the same on the primary. > > Actually, there are two things that really bug me: > * why the difference between primary and both standbys? > * why now? (it worked great before this weekend, and the only thing I know > happened before is a batch delete on sunday... which may be a good-enough > reason for things to get screwed, but once again, why only both standbys?) > > Julien Rouhaud also told me about killtuples, but I have no idea what they > are. I suppose this is different from dead tuples. Anyway, if you can > enlighten me, I'll be happy :) That's an optimisation where an index scan can mark an index entry as dead (LP_DEAD) if if tries to fetch some data from the heap that turns out to be all dead, so further scans won't have to check again (you can grep kill_prior_tuple in the source for more details). As that's a hint bit, it may not be replicated unless you enable wal_log_hints or data_checksums (or write it as a FPI indeed), which could explain discrepancy between primary (after a first slow index scan) and standby nodes. But since your customer recreated their standbys from scratch *after* that delete, all the nodes should have those hint bits set (Guillaume confirmed off-list that they used a fresh BASE_BACKUP). Note that Guillaume also confirmed off-list that the customer has checksums enabled, which means that MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm out of ideas to explain the different behavior on standbys.
Re: Weird planner issue on a standby
Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera a écrit : > On 2022-Oct-11, Tom Lane wrote: > > > Are there any tables in this query where extremal values of the join > > key are likely to be in recently-added or recently-dead rows? Does > > VACUUM'ing on the primary help? > > I remember having an hypothesis, upon getting a report of this exact > problem on a customer system once, that it could be due to killtuple not > propagating to standbys except by FPIs. I do not remember if we proved > that true or not. I do not remember observing that tables were being > read, however. > > Thanks for your answers. The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is yesterday) is much probably recently-added. I can ask my customer if you want but this looks like a pretty safe bet. On the VACUUM question, I didn't say, but we're kind of wondering if it was lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the database (and especially on the 1.6TB table which is part of the query). I'm kind of skeptical because if the VACUUM wasn't enough on the standby, it should be the same on the primary. Actually, there are two things that really bug me: * why the difference between primary and both standbys? * why now? (it worked great before this weekend, and the only thing I know happened before is a batch delete on sunday... which may be a good-enough reason for things to get screwed, but once again, why only both standbys?) Julien Rouhaud also told me about killtuples, but I have no idea what they are. I suppose this is different from dead tuples. Anyway, if you can enlighten me, I'll be happy :) -- Guillaume.
Re: Weird planner issue on a standby
On 2022-Oct-11, Tom Lane wrote: > Are there any tables in this query where extremal values of the join > key are likely to be in recently-added or recently-dead rows? Does > VACUUM'ing on the primary help? I remember having an hypothesis, upon getting a report of this exact problem on a customer system once, that it could be due to killtuple not propagating to standbys except by FPIs. I do not remember if we proved that true or not. I do not remember observing that tables were being read, however. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Weird planner issue on a standby
Guillaume Lelarge writes: > * The planner seems to read tables to generate the plans and/or select the > right plan. Why does it do that? I thought it only reads indexes if needed, > but it is quite clear it reads tables also. I'm guessing to some extent, but it seems plausible that this is an artifact of the logic that tries to get the current min and/or max column values to be sure we are estimating mergejoin costs accurately. Normally, yeah, that is only supposed to consult the extremal values in indexes and therefore not take very long. However, we've had to go back to the drawing board several times to fix cases where it made a lot of expensive table accesses because the extremal index entries were of uncertain liveness. That behavior could very easily depend on whether you're in a primary or standby server. Are there any tables in this query where extremal values of the join key are likely to be in recently-added or recently-dead rows? Does VACUUM'ing on the primary help? And, not to put too fine a point on it, but does updating to a recent minor release help? In a quick scan of the git history I don't see any related bugfixes later than 11.5, but I might've missed something. regards, tom lane