Re: Reindex "locked" standby database

2021-12-14 Thread Martín Fernández



> On 15 Dec 2021, at 00:52, Mladen Gogala  wrote:
> 
> On 12/14/21 22:37, Michael Paquier wrote:
>> You are referring to the startup process that replays WAL, right?
>> Without having an idea about the type of workload your primary and/or
>> standbys are facing, as well as an idea of the configuration you are
>> using on both (hot_standby_feedback for one), I have no direct idea,
>> but that could be a conflict caused by a concurrent vacuum.
> 
> Hi Michael,
> 
> I am preparing for a standby deployment. I don't have a standby yet and, 
> therefore, I don't have any standby problems. Would it be advisable to turn 
> vacuum off on the standby? Applying WAL will also, in theory, populate the 
> statistics which is also held in the database blocks.

Take this with grain of salt since I’m far from being an expert :) , just 
trying to help. To my knowledge, assuming you would be running a physical 
standby, vacuum operations wouldn’t run there, only in the primary. Changes 
would get propagated via physical replication to your standby (blocks that 
change due to vacuuming on the primary). 

Hope that helps.

> 
> Regards
> 
> -- 
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217
> https://dbwhisperer.wordpress.com
> 
> 
> 





Re: Reindex "locked" standby database

2021-12-14 Thread Martín Fernández
Micheal,

Thanks for much for the quick response.

> On 15 Dec 2021, at 00:37, Michael Paquier  wrote:
> 
> On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:
>> The reindex went fine in the primary database and in one of our
>> standby. The other standby that we also operate for some reason
>> ended up in a state where all transactions were locked by the WAL
>> process and the WAL process was not able to make any progress. In
>> order to solve this issue we had to move traffic from the “bad”
>> standby to the healthy one and then kill all transactions that were
>> running in the “bad” standby. After that, replication was able to
>> resume successfully.
> 
> You are referring to the startup process that replays WAL, right?
That is correct, I’m talking about the startup process that replays the WAL 
files.

> Without having an idea about the type of workload your primary and/or
> standbys are facing, as well as an idea of the configuration you are
> using on both (hot_standby_feedback for one), I have no direct idea,

Primary handle IOT data ingestion. The table that we had to REINDEX gets 
updated every time a new message arrives in the system so updated are happening 
very often on that table, thus, the index/table bloat. The standby at any point 
in time would be receiving queries that would take advantage of the indexes 
that were being re indexed.  hot_standby_feedback is currently turned OFF on 
the standbys. 

> but that could be a conflict caused by a concurrent vacuum.


> 
> Seeing where things got stuck could also be useful, perhaps with a
> backtrace of the area where it happens and some information around
> it.
> 
>> I’m just trying to understand what could have caused this issue. I
>> was not able to identify any queries in the standby that would be
>> locking the WAL process. Any insight would be more than welcome!
> 
> That's not going to be easy without more information, I am afraid.
> --
> Michael



Re: Reindex "locked" standby database

2021-12-14 Thread Mladen Gogala

On 12/14/21 22:37, Michael Paquier wrote:

You are referring to the startup process that replays WAL, right?
Without having an idea about the type of workload your primary and/or
standbys are facing, as well as an idea of the configuration you are
using on both (hot_standby_feedback for one), I have no direct idea,
but that could be a conflict caused by a concurrent vacuum.


Hi Michael,

I am preparing for a standby deployment. I don't have a standby yet and, 
therefore, I don't have any standby problems. Would it be advisable to 
turn vacuum off on the standby? Applying WAL will also, in theory, 
populate the statistics which is also held in the database blocks.


Regards

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com





Re: Reindex "locked" standby database

2021-12-14 Thread Michael Paquier
On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:
> The reindex went fine in the primary database and in one of our
> standby. The other standby that we also operate for some reason
> ended up in a state where all transactions were locked by the WAL
> process and the WAL process was not able to make any progress. In
> order to solve this issue we had to move traffic from the “bad”
> standby to the healthy one and then kill all transactions that were
> running in the “bad” standby. After that, replication was able to
> resume successfully.

You are referring to the startup process that replays WAL, right?
Without having an idea about the type of workload your primary and/or
standbys are facing, as well as an idea of the configuration you are
using on both (hot_standby_feedback for one), I have no direct idea,
but that could be a conflict caused by a concurrent vacuum.

Seeing where things got stuck could also be useful, perhaps with a
backtrace of the area where it happens and some information around
it.

> I’m just trying to understand what could have caused this issue. I
> was not able to identify any queries in the standby that would be
> locking the WAL process. Any insight would be more than welcome!

That's not going to be easy without more information, I am afraid.
--
Michael


signature.asc
Description: PGP signature


Reindex "locked" standby database

2021-12-14 Thread Martín Fernández
Hello pg hackers!

Today we had to run a `REINDEX table CONCURRENTLY my_table;` in our production 
database due to considerable index bloat. We used to deal with this problem in 
the past by using pg_repack but we stopped using it because our data 
replication tool doesn’t support “re creating” tables in the way that pg_repack 
does it (we are using Fivetran).

The reindex went fine in the primary database and in one of our standby. The 
other standby that we also operate for some reason ended up in a state where 
all transactions were locked by the WAL process and the WAL process was not 
able to make any progress. In order to solve this issue we had to move traffic 
from the “bad” standby to the healthy one and then kill all transactions that 
were running in the “bad” standby. After that, replication was able to resume 
successfully. 

I’m just trying to understand what could have caused this issue. I was not able 
to identify any queries in the standby that would be locking the WAL process. 
Any insight would be more than welcome!

We are running pg12 in our cluster and standbys are replicating using physical 
replication lots.

Best,
Martín