Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-25 Thread SATYANARAYANA NARLAPURAM
On Sat, Jan 14, 2023 at 12:34 PM Andres Freund  wrote:

> Hi,
>
> On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > Please review the attached v2 patch further.
> >
> > I'm still unclear on the performance goals of this patch. I see that it
> > will reduce syscalls, which sounds good, but to what end?
> >
> > Does it allow a greater number of walsenders? Lower replication
> > latency? Less IO bandwidth? All of the above?
>
> One benefit would be that it'd make it more realistic to use direct IO for
> WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less clearly a
> win.
>

 +1. Archive modules rely on reading the wal files for PITR. Direct IO for
WAL requires reading these files from disk anyways for archival. However,
Archiving using utilities like pg_receivewal can take advantage of this
patch together with direct IO for WAL.

Thanks,
Satya


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 29, 2022 at 11:20 AM SATYANARAYANA NARLAPURAM <
satyanarlapu...@gmail.com> wrote:

>
>
> On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin 
> wrote:
>
>> On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian  wrote:
>> >
>> > On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM
>> wrote:
>> > > 2. Process proc die immediately when a backend is waiting for sync
>> > > replication acknowledgement, as it does today, however, upon
>> restart,
>> > > don't open up for business (don't accept ready-only connections)
>> > > unless the sync standbys have caught up.
>> > >
>> > >
>> > > Are you planning to block connections or queries to the database? It
>> would be
>> > > good to allow connections and let them query the monitoring views but
>> block the
>> > > queries until sync standby have caught up. Otherwise, this leaves a
>> monitoring
>> > > hole. In cloud, I presume superusers are allowed to connect and
>> monitor (end
>> > > customers are not the role members and can't query the data). The
>> same can't be
>> > > true for all the installations. Could you please add more details on
>> your
>> > > approach?
>> >
>> > I think ALTER SYSTEM should be allowed, particularly so you can modify
>> > synchronous_standby_names, no?
>>
>> We don't allow SQL access during crash recovery until it's caught up
>> to consistency point. And that's for a reason - the cluster may have
>> invalid system catalog.
>> So no, after crash without a quorum of standbys you can only change
>> auto.conf and send SIGHUP. Accessing the system catalog during crash
>> recovery is another unrelated problem.
>>
>
> In the crash recovery case, catalog is inconsistent but in this case, the
> cluster has remote uncommitted changes (consistent). Accepting a superuser
> connection is no harm. The auth checks performed are still valid after
> standbys fully caught up. I don't see a reason why superuser / pg_monitor
> connections are required to be blocked.
>

If blocking queries is harder, and superuser is not allowed to connect as
it can read remote uncommitted data,  how about adding a new role that  can
update and reload the server configuration?

>
>
>> But I'd propose to treat these two points differently, they possess
>> drastically different scales of danger. Query Cancels are issued here
>> and there during failovers\switchovers. Crash amidst network
>> partitioning is not that common.
>>
>
> Supportability and operability are more important in corner cases to
> quickly troubleshoot an issue,
>
>
>>
>> Best regards, Andrey Borodin.
>>
>


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 29, 2022 at 10:52 AM Andrey Borodin 
wrote:

> On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian  wrote:
> >
> > On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> > > 2. Process proc die immediately when a backend is waiting for sync
> > > replication acknowledgement, as it does today, however, upon
> restart,
> > > don't open up for business (don't accept ready-only connections)
> > > unless the sync standbys have caught up.
> > >
> > >
> > > Are you planning to block connections or queries to the database? It
> would be
> > > good to allow connections and let them query the monitoring views but
> block the
> > > queries until sync standby have caught up. Otherwise, this leaves a
> monitoring
> > > hole. In cloud, I presume superusers are allowed to connect and
> monitor (end
> > > customers are not the role members and can't query the data). The same
> can't be
> > > true for all the installations. Could you please add more details on
> your
> > > approach?
> >
> > I think ALTER SYSTEM should be allowed, particularly so you can modify
> > synchronous_standby_names, no?
>
> We don't allow SQL access during crash recovery until it's caught up
> to consistency point. And that's for a reason - the cluster may have
> invalid system catalog.
> So no, after crash without a quorum of standbys you can only change
> auto.conf and send SIGHUP. Accessing the system catalog during crash
> recovery is another unrelated problem.
>

In the crash recovery case, catalog is inconsistent but in this case, the
cluster has remote uncommitted changes (consistent). Accepting a superuser
connection is no harm. The auth checks performed are still valid after
standbys fully caught up. I don't see a reason why superuser / pg_monitor
connections are required to be blocked.


> But I'd propose to treat these two points differently, they possess
> drastically different scales of danger. Query Cancels are issued here
> and there during failovers\switchovers. Crash amidst network
> partitioning is not that common.
>

Supportability and operability are more important in corner cases to
quickly troubleshoot an issue,


>
> Best regards, Andrey Borodin.
>


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 29, 2022 at 8:42 AM SATYANARAYANA NARLAPURAM <
satyanarlapu...@gmail.com> wrote:

>
>
> On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian  wrote:
>
>> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
>> > 2. Process proc die immediately when a backend is waiting for sync
>> > replication acknowledgement, as it does today, however, upon
>> restart,
>> > don't open up for business (don't accept ready-only connections)
>> > unless the sync standbys have caught up.
>> >
>> >
>> > Are you planning to block connections or queries to the database? It
>> would be
>> > good to allow connections and let them query the monitoring views but
>> block the
>> > queries until sync standby have caught up. Otherwise, this leaves a
>> monitoring
>> > hole. In cloud, I presume superusers are allowed to connect and monitor
>> (end
>> > customers are not the role members and can't query the data). The same
>> can't be
>> > true for all the installations. Could you please add more details on
>> your
>> > approach?
>>
>> I think ALTER SYSTEM should be allowed, particularly so you can modify
>> synchronous_standby_names, no?
>
>
> Yes, Change in synchronous_standby_names is expected in this situation.
> IMHO, blocking all the connections is not a recommended approach.
>

How about allowing superusers (they can still read locally committed data)
and users part of pg_monitor role?


>
>>
>> --
>>   Bruce Momjian  https://momjian.us
>>   EDB  https://enterprisedb.com
>>
>> Embrace your flaws.  They make you human, rather than perfect,
>> which you will never be.
>>
>


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 29, 2022 at 8:29 AM Bruce Momjian  wrote:

> On Tue, Nov 29, 2022 at 08:14:10AM -0800, SATYANARAYANA NARLAPURAM wrote:
> > 2. Process proc die immediately when a backend is waiting for sync
> > replication acknowledgement, as it does today, however, upon restart,
> > don't open up for business (don't accept ready-only connections)
> > unless the sync standbys have caught up.
> >
> >
> > Are you planning to block connections or queries to the database? It
> would be
> > good to allow connections and let them query the monitoring views but
> block the
> > queries until sync standby have caught up. Otherwise, this leaves a
> monitoring
> > hole. In cloud, I presume superusers are allowed to connect and monitor
> (end
> > customers are not the role members and can't query the data). The same
> can't be
> > true for all the installations. Could you please add more details on your
> > approach?
>
> I think ALTER SYSTEM should be allowed, particularly so you can modify
> synchronous_standby_names, no?


Yes, Change in synchronous_standby_names is expected in this situation.
IMHO, blocking all the connections is not a recommended approach.


>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
> Embrace your flaws.  They make you human, rather than perfect,
> which you will never be.
>


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-11-29 Thread SATYANARAYANA NARLAPURAM
On Sun, Nov 27, 2022 at 10:33 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Mon, Nov 28, 2022 at 12:57 AM Andrey Borodin 
> wrote:
> >
> > Some funny stuff. If a user tries to cancel a non-replicated transaction
> > Azure Postgres will answer: "user requested cancel while waiting for
> > synchronous replication ack. The COMMIT record has already flushed to
> > WAL locally and might not have been replicatead to the standby. We
> > must wait here."
> > AWS RDS will answer: "ignoring request to cancel wait for synchronous
> > replication"
> > Yandex Managed Postgres will answer: "canceling wait for synchronous
> > replication due requested, but cancelation is not allowed. The
> > transaction has already committed locally and might not have been
> > replicated to the standby. We must wait here."
> >
> > So, for many services providing Postgres as a service it's only a
> > matter of wording.
>
> Thanks for verifying the behaviour. And many thanks for an off-list chat.
>
> FWIW, I'm planning to prepare a patch as per the below idea which is
> something similar to the initial proposal in this thread. Meanwhile,
> thoughts are welcome.
>
> 1. Disable query cancel/CTRL+C/SIGINT when a backend is waiting for
> sync replication acknowledgement.
>

+1


> 2. Process proc die immediately when a backend is waiting for sync
> replication acknowledgement, as it does today, however, upon restart,
> don't open up for business (don't accept ready-only connections)
> unless the sync standbys have caught up.
>

Are you planning to block connections or queries to the database? It would
be good to allow connections and let them query the monitoring views but
block the queries until sync standby have caught up. Otherwise, this leaves
a monitoring hole. In cloud, I presume superusers are allowed to connect
and monitor (end customers are not the role members and can't query the
data). The same can't be true for all the installations. Could you please
add more details on your approach?


>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>


Re: WIP: WAL prefetch (another approach)

2022-04-12 Thread SATYANARAYANA NARLAPURAM
> Other way around. FPWs make prefetch unnecessary.
> Therefore you would only want prefetch with FPW=off, AFAIK.
>
A few scenarios I can imagine page prefetch can help are, 1/ A DR replica
instance that is smaller instance size than primary. Page prefetch can
bring the pages back into memory in advance when they are evicted. This
speeds up the replay and is cost effective. 2/ Allows larger
checkpoint_timeout for the same recovery SLA and perhaps improved
performance? 3/ WAL prefetch (not pages by itself) can improve replay by
itself (not sure if it was measured in isolation, Tomas V can comment on
it). 4/ Read replica running analytical workload scenario Tomas V mentioned
earlier.


>
> Or put this another way: when is it safe and sensible to use
> recovery_prefetch != off?
>
When checkpoint_timeout is set large and under heavy write activity, on a
read replica that has working set higher than the memory and receiving
constant updates from primary. This covers 1 & 4 above.


> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>
>


Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-04-11 Thread SATYANARAYANA NARLAPURAM
On Sun, Apr 10, 2022 at 11:16 PM Kyotaro Horiguchi 
wrote:

> Sorry for the terrible typos..
>
> At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote in
> > On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
> >  wrote:
> > >
> > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier 
> wrote:
> > >> Are you referring to the pre-padding when creating a new partial
> > >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
> > >> the file is fully created?  What kind of error did you see?  I guess
> > >> that a write() with ENOSPC would be more likely, but you got a
> > >> different problem?
> > >
> > > I see two cases, 1/ when no space  is left on the device and 2/ when
> the process is taken down forcibly (a VM/container crash)
> >
> > Yeah, these cases can occur leaving uninitialized .partial files which
> > can be a problem for both pg_receivewal and pg_basebackup that uses
> > dir_open_for_write (CreateWalDirectoryMethod).
> >
> > >>   I don't disagree with improving such cases, but we
> > >> should not do things so as there is a risk of leaving behind an
> > >> infinite set of segments in case of repeated errors
> > >
> > > Do you see a problem with the proposed patch that leaves the files
> behind, at least in my testing I don't see any files left behind?
>
> I guess that Michael took this patch as creating a temp file with a
> name such like "tmp.n" every time finding an incomplete file.
>
> > With the proposed patch, it doesn't leave the unpadded .partial files.
> > Also, the v2 patch always removes a leftover .partial.temp file before
> > it creates a new one.
> >
> > >> , and partial
> > >> segments are already a kind of temporary file.
>
> I'm not sure this is true for pg_receivewal case.  The .partial file
> is not a temporary file but the current working file for the tool.
>

Correct. The idea is to make sure the file is fully allocated before
treating it as a current file.


>
> > > if the .partial file exists with not zero-padded up to the wal segment
> size (WalSegSz), then open_walfile fails with the below error. I have two
> options here, 1/ to continue padding the existing partial file and let it
> zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought
> the latter is safe because it can handle corrupt cases as described below.
> Thoughts?
>
> I think this patch shouldn't involve pg_basebackup.  I agree to Cary
> that deleting the erroring file should be fine.
>
> We already "skipping" (complete = non-.partial) WAL files with a wrong
> size in FindStreamingStart so we can error-out with suggesting a hint.
>
> $ pg_receivewal -D xlog -p 5432 -h /tmp
> pg_receivewal: error: segment file "0001002200F5.partial" has
> incorrect size 8404992
> hint: You can continue after removing the file.
>

The idea here is to make pg_receivewal self sufficient and reduce
human/third party tool interaction. Ideal case is running pg_Receivewal as
a service for wal archiving.


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-08 Thread SATYANARAYANA NARLAPURAM
On Fri, Apr 8, 2022 at 6:44 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> >  wrote:
> > >
> > > Hi,
> > >
> > > I'm thinking if there's a way in core postgres to achieve $subject. In
> > > reality, the sync/async standbys can either be closer/farther (which
> > > means sync/async standbys can receive WAL at different times) to
> > > primary, especially in cloud HA environments with primary in one
> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > > $subject may not be possible on dev systems (say, for testing some HA
> > > features) unless we can inject a delay in WAL senders before sending
> > > WAL.
>

Simulation will be helpful even for end customers to simulate faults in the
production environments during availability zone/disaster recovery drills.



> > >
> > > How about having two developer-only GUCs {async,
> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > > delay sending WAL by {async, sync}_wal_sender_delay
> > > milliseconds/seconds? Although, I can't think of any immediate use, it
> > > will be useful someday IMO, say for features like [1], if it gets in.
> > > With this set of GUCs, one can even add core regression tests for HA
> > > features.
>

I would suggest doing this at the slot level, instead of two GUCs that
control the behavior of all the slots (physical/logical). Something like
"pg_suspend_replication_slot and pg_Resume_replication_slot"?
Alternatively a GUC on the standby side instead of primary so that the wal
receiver stops responding to the wal sender? This helps achieve the same as
above but the granularity is now at individual replica level.

Thanks,
Satya


Re: Stream Replication not working

2022-01-10 Thread SATYANARAYANA NARLAPURAM
Seems there is a problem with the replay on your standby. Either it is too
slow or stuck behind some locks ( replay_lag of 20:38:47.00904 indicates
this and the flush_lsn is the same as lsn on primary ) . Run pg_locks to
see if the replay is stuck behind a lock.



On Mon, Jan 10, 2022 at 11:53 AM Allie Crawford <
crawfor...@churchofjesuschrist.org> wrote:

> Hi All,
>
> I have implemented Stream replication in one of my environments, and for
> some reason even though all the health checks are showing that the
> replication is working, when I run manual tests to see if changes are being
> replicated, the changes are not replicated to the standby postgresql
> environment. I have been researching for two day and I cannot find any
> documentation that talks about the case I am running into. I will
> appreciate if anybody could take a look at the details I have detailed
> below and give me some guidance on where the problem might be that is
> preventing my changes for being replicated. Even though I was able to
> instantiate the standby while firewalld was enabled, I decided to disable
> it just in case that it was causing any issue to the manual changes, but
> disabling firewalld has not had any effect, I am still not able to get the
> manual changes test to be replicated to the standby site. As you will see
> in the details below, the streaming is working, both sites are in sync to
> the latest WAL but for some reasons the latest changes are not on the
> standby site. How is it possible that the standby site is completely in
> sync but yet does not contain the latest changes?
>
>
>
> Thanks in advance for any help you can give me with this problem.
>
>
>
> Regards,
>
> Allie
>
>
>
> *Details:*
>
>
>
> *Master **postgresql Environment*
>
> postgresql=# select * from pg_stat_replication;
>
> -[ RECORD 1 ]+--
>
> pid  | 1979089
>
> usesysid | 16404
>
> usename  | replacct
>
> application_name | walreceiver
>
> client_addr  | 
>
> client_hostname  | 
>
> client_port  | 55096
>
> backend_start| 2022-01-06 17:29:51.542784-07
>
> backend_xmin |
>
> state| streaming
>
> sent_lsn | 0/35000788
>
> write_lsn| 0/35000788
>
> flush_lsn| 0/35000788
>
> replay_lsn   | 0/31000500
>
> write_lag| 00:00:00.001611
>
> flush_lag| 00:00:00.001693
>
> replay_lag   | 20:38:47.00904
>
> sync_priority| 1
>
> sync_state   | sync
>
> reply_time   | 2022-01-07 14:11:58.996277-07
>
>
>
> postgresql=#
>
>
>
> postgresql=# select * from pg_roles;
>
>   rolname  | rolsuper | rolinherit | rolcreaterole |
> rolcreatedb | rolcanlogin | rolreplication | rolconnlimit | rolpassword |
> rolvaliduntil | rolbypassrls | rolconfig |  oid
>
>
> ---+--++---+-+-++--+-+---+--+---+---
>
>  postgresql| t| t  | t | t
> | t   | t  |   -1 | |
>   | t|   |10
>
>  pg_monitor| f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  3373
>
>  pg_read_all_settings  | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  3374
>
>  pg_read_all_stats | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  3375
>
>  pg_stat_scan_tables   | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  3377
>
>  pg_read_server_files  | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  4569
>
>  pg_write_server_files | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  4570
>
>  pg_execute_server_program | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  4571
>
>  pg_signal_backend | f| t  | f | f
> | f   | f  |   -1 | |
>   | f|   |  4200
>
>  replacct  | t| t  | t | t
> | t   | t  |   -1 | |
>   | t|   | 16404
>
> (10 rows)
>
>
>
> postgresql=#
>
>
>
> postgresql=# 

Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

2022-01-08 Thread SATYANARAYANA NARLAPURAM
On Fri, Jan 7, 2022 at 4:52 PM Jeff Davis  wrote:

> On Fri, 2022-01-07 at 14:54 -0800, Andres Freund wrote:
> > > If you only promote the furthest-ahead sync replica (which is what
> > > you
> > > should be doing if you have quorum commit), why wouldn't that work?
> >
> > Remove "sync" from the above sentence, and the sentence holds true
> > for
> > combinations of sync/async replicas as well.
>
> Technically that's true, but it seems like a bit of a strange use case.
> I would think people doing that would just include those async replicas
> in the sync quorum instead.
>
> The main case I can think of for a mix of sync and async replicas are
> if they are just managed differently. For instance, the sync replica
> quorum is managed for a core part of the system, strategically
> allocated on good hardware in different locations to minimize the
> chance of dependent failures; while the async read replicas are
> optional for taking load off the primary, and may appear/disappear in
> whatever location and on whatever hardware is most convenient.
>
> But if an async replica can get ahead of the sync rep quorum, then the
> most recent transactions can appear in query results, so that means the
> WAL shouldn't be lost, and the async read replicas become a part of the
> durability model.
>
> If the async read replica can't be promoted because it's not suitable
> (due to location, hardware, whatever), then you need to frantically
> copy the final WAL records out to an instance in the sync rep quorum.
> That requires extra ceremony for every failover, and might be dubious
> depending on how safe the WAL on your async read replicas is, and
> whether there are dependent failure risks.
>

This may not even be possible always as described in the scenario below.


>
> Yeah, I guess there could be some use case woven amongst those caveats,
> but I'm not sure if anyone is actually doing that combination of things
> safely today. If someone is, it would be interesting to know more about
> that use case.
>
> The proposal in this thread is quite a bit simpler: manage your sync
> quorum and your async read replicas separately, and keep the sync rep
> quorum ahead.
>
> > > > To me this just sounds like trying to shoehorn something into
> > > > syncrep
> > > > that
> > > > it's not made for.
> > >
> > > What *is* sync rep made for?
>
> This was a sincere question and an answer would be helpful. I think
> many of the discussions about sync rep get derailed because people have
> different ideas about when and how it should be used, and the
> documentation is pretty light.
>
> > This is a especially relevant in cases where synchronous_commit=on vs
> > local is
> > used selectively
>
> That's an interesting point.
>
> However, it's hard for me to reason about "kinda durable" and "a little
> more durable" and I'm not sure how many people would care about that
> distinction.
>
> > I don't see that. This presumes that WAL replicated to async replicas
> > is
> > somehow bad.
>
> Simple case: primary and async read replica are in the same server
> rack. Sync replicas are geographically distributed with quorum commit.
> Read replica gets the WAL first (because it's closest), starts
> answering queries that include that WAL, and then the entire rack
> catches fire. Now you've returned results to the client, but lost the
> transactions.
>

Another similar example is, in a multi-AZ HA setup, primary and sync
replicas are deployed in two different availability zones and the async
replicas for reads can be in any availability zone and assume the async
replica and primary land in the same AZ. Primary availability zone going
down leads to both primary and async replica going down at the same time.
This async replica could be ahead of sync replica and WAL can't be
collected as both primary and async replica failed together.


>
> Regards,
> Jeff Davis
>
>
>


Re: Logging replication state changes

2022-01-08 Thread SATYANARAYANA NARLAPURAM
On Sat, Jan 8, 2022 at 3:26 AM Amit Kapila  wrote:

> On Thu, Dec 30, 2021 at 4:18 AM SATYANARAYANA NARLAPURAM
>  wrote:
> >
> > On Wed, Dec 29, 2021 at 2:04 PM Tom Lane  wrote:
> >>
> >> SATYANARAYANA NARLAPURAM  writes:
> >> > I noticed that below critical replication state changes are DEBUG1
> level
> >> > logged. Any concern about changing the below two messages log level
> to LOG?
> >>
> >> Why?  These seem like perfectly routine messages.
> >
> >
> > Consider a scenario where we have a primary and two sync standby (s1 and
> s2) where s1 is a preferred failover target and s2 is next with
> synchronous_standby_names = 'First 1 ('s1','s2')'.  In an event, s1
> streaming replication is broken and reestablished because of a planned or
> an unplanned event then s2 participates in the sync commits and makes sure
> the writes are not stalled on the primary. I would like to know the time
> window where s1 is not actively acknowledging the commits and the writes
> are dependent on s2. Also if the service layer decides to failover to s2
> instead of s1 because s1 is lagging I need evidence in the log to explain
> the behavior.
> >
>
> Isn't it better to get this information via pg_stat_replication view
> (via state and sync_priority) columns?
>

We need the historical information to analyze and root cause in addition to
the live debugging. It would be good to have better control over
replication messages.


>
> --
> With Regards,
> Amit Kapila.
>


Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

2022-01-07 Thread SATYANARAYANA NARLAPURAM
On Fri, Jan 7, 2022 at 12:27 AM Kyotaro Horiguchi 
wrote:

> At Thu, 6 Jan 2022 23:55:01 -0800, SATYANARAYANA NARLAPURAM <
> satyanarlapu...@gmail.com> wrote in
> > On Thu, Jan 6, 2022 at 11:24 PM Jeff Davis  wrote:
> >
> > > On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
> > > > I would like to propose a GUC send_Wal_after_quorum_committed which
> > > > when set to ON, walsenders corresponds to async standbys and logical
> > > > replication workers wait until the LSN is quorum committed on the
> > > > primary before sending it to the standby. This not only simplifies
> > > > the post failover steps but avoids unnecessary downtime for the async
> > > > replicas. Thoughts?
> > >
> > > Do we need a GUC? Or should we just always require that sync rep is
> > > satisfied before sending to async replicas?
> > >
> >
> > I proposed a GUC to not introduce a behavior change by default. I have no
> > strong opinion on having a GUC or making the proposed behavior default,
> > would love to get others' perspectives as well.
> >
> >
> > >
> > > It feels like the sync quorum should always be ahead of the async
> > > replicas. Unless I'm missing a use case, or there is some kind of
> > > performance gotcha.
> > >
> >
> > I couldn't think of a case that can cause serious performance issues but
> > will run some experiments on this and post the numbers.
>
> I think Jeff is saying that "quorum commit" already by definition
> means that all out-of-quorum standbys are behind of the
> quorum-standbys.  I agree to that in a dictionary sense. But I can
> think of the case where the response from the top-runner standby
> vanishes or gets caught somewhere on network for some reason. In that
> case the primary happily checks quorum ignoring the top-runner.
>
> To avoid that misdecision, I can guess two possible "solutions".
>
> One is to serialize WAL sending (of course it is unacceptable at all)
> or aotehr is to send WAL to all standbys at once then make the
> decision after making sure receiving replies from all standbys (this
> is no longer quorum commit in another sense..)
>

There is no need to serialize sending the WAL among sync standbys. The only
serialization required is first to all the sync replicas and then to sync
replicas if any. Once an LSN is quorum committed, no failover subsystem
initiates an automatic failover such that the LSN is lost (data loss)

>
> So I'm afraid that there's no sensible solution to avoid the
> hiding-forerunner problem on quorum commit.
>

Could you elaborate on the problem here?


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

2022-01-06 Thread SATYANARAYANA NARLAPURAM
On Thu, Jan 6, 2022 at 11:24 PM Jeff Davis  wrote:

> On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote:
> > I would like to propose a GUC send_Wal_after_quorum_committed which
> > when set to ON, walsenders corresponds to async standbys and logical
> > replication workers wait until the LSN is quorum committed on the
> > primary before sending it to the standby. This not only simplifies
> > the post failover steps but avoids unnecessary downtime for the async
> > replicas. Thoughts?
>
> Do we need a GUC? Or should we just always require that sync rep is
> satisfied before sending to async replicas?
>

I proposed a GUC to not introduce a behavior change by default. I have no
strong opinion on having a GUC or making the proposed behavior default,
would love to get others' perspectives as well.


>
> It feels like the sync quorum should always be ahead of the async
> replicas. Unless I'm missing a use case, or there is some kind of
> performance gotcha.
>

I couldn't think of a case that can cause serious performance issues but
will run some experiments on this and post the numbers.



>
> Regards,
> Jeff Davis
>
>
>


Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers

2022-01-06 Thread SATYANARAYANA NARLAPURAM
Consider a cluster formation where we have a Primary(P), Sync Replica(S1),
and multiple async replicas for disaster recovery and read scaling (within
the region and outside the region). In this setup, S1 is the preferred
failover target in an event of the primary failure. When a transaction is
committed on the primary, it is not acknowledged to the client until the
primary gets an acknowledgment from the sync standby that the WAL is
flushed to the disk (assume synchrnous_commit configuration is
remote_flush). However, walsenders corresponds to the async replica on the
primary don't wait for the flush acknowledgment from the primary and send
the WAL to the async standbys (also any logical replication/decoding
clients). So it is possible for the async replicas and logical client ahead
of the sync replica. If a failover is initiated in such a scenario, to
bring the formation into a healthy state we have to either

   1.  run the pg_rewind on the async replicas for them to reconnect with
   the new primary or
   2. collect the latest WAL across the replicas and feed the standby.

Both these operations are involved, error prone, and can cause multiple
minutes of downtime if done manually. In addition, there is a window where
the async replicas can show the data that was neither acknowledged to the
client nor committed on standby. Logical clients if they are ahead may need
to reseed the data as no easy rewind option for them.

I would like to propose a GUC send_Wal_after_quorum_committed which when
set to ON, walsenders corresponds to async standbys and logical replication
workers wait until the LSN is quorum committed on the primary before
sending it to the standby. This not only simplifies the post failover steps
but avoids unnecessary downtime for the async replicas. Thoughts?

Thanks,
Satya




On Sun, Dec 5, 2021 at 8:35 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> It looks like the logical replication subscribers are receiving the
> quorum uncommitted transactions even before the synchronous (sync)
> standbys. Most of the times it is okay, but it can be a problem if the
> primary goes down/crashes (while the primary is in SyncRepWaitForLSN)
> before the quorum commit is achieved (i.e. before the sync standbys
> receive the committed txns from the primary) and the failover is to
> happen on to the sync standby. The subscriber would have received the
> quorum uncommitted txns whereas the sync standbys didn't. After the
> failover, the new primary (the old sync standby) would be behind the
> subscriber i.e. the subscriber will be seeing the data that the new
> primary can't. Is there a way the subscriber can get back  to be in
> sync with the new primary? In other words, can we reverse the effects
> of the quorum uncommitted txns on the subscriber? Naive way is to do
> it manually, but it doesn't seem to be elegant.
>
> We have performed a small experiment to observe the above behaviour
> with 1 primary, 1 sync standby and 1 subscriber:
> 1) Have a wait loop in SyncRepWaitForLSN (a temporary hack to
> illustrate the standby receiving the txn a bit late or fail to
> receive)
> 2) Insert data into a table on the primary
> 3) The primary waits i.e. the insert query hangs (because of the wait
> loop hack ()) before the local txn is sent to the sync standby,
> whereas the subscriber receives the inserted data.
> 4) If the primary crashes/goes down and unable to come up, if the
> failover happens to sync standby (which didn't receive the data that
> got inserted on tbe primary), the subscriber would see the data that
> the sync standby can't.
>
> This looks to be a problem. A possible solution is to let the
> subscribers receive the txns only after the primary achieves quorum
> commit (gets out of the SyncRepWaitForLSN or after all sync standbys
> received the txns). The logical replication walsenders can wait until
> the quorum commit is obtained and then can send the WAL. A new GUC can
> be introduced to control this, default being the current behaviour.
>
> Thoughts?
>
> Thanks Satya (cc-ed) for the use-case and off-list discussion.
>
> Regards,
> Bharath Rupireddy.
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2022-01-05 Thread SATYANARAYANA NARLAPURAM
On Wed, Jan 5, 2022 at 10:05 PM Dilip Kumar  wrote:

> On Thu, Jan 6, 2022 at 11:27 AM SATYANARAYANA NARLAPURAM
>  wrote:
>
> > On Wed, Jan 5, 2022 at 9:46 AM Andres Freund  wrote:
> >>
> >> Hi,
> >>
> >> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> >> > That's pretty much the same - XLogInsert() can trigger an
> >> > XLogWrite()/Flush().
> >> >
> >> > I think it's a complete no-go to add throttling to these places. It's
> quite
> >> > possible that it'd cause new deadlocks, and it's almost guaranteed to
> have
> >> > unintended consequences (e.g. replication falling back further because
> >> > XLogFlush() is being throttled).
> >>
> >> I thought of another way to implement this feature. What if we checked
> the
> >> current distance somewhere within XLogInsert(), but only set
> >> InterruptPending=true, XLogDelayPending=true. Then in
> ProcessInterrupts() we
> >> check if XLogDelayPending is true and sleep the appropriate time.
> >>
> >> That way the sleep doesn't happen with important locks held / within a
> >> critical section, but we still delay close to where we went over the
> maximum
> >> lag. And the overhead should be fairly minimal.
> >
> >
> > +1 to the idea, this way we can fairly throttle large and smaller
> transactions the same way. I will work on this model and share the patch.
> Please note that the lock contention still exists in this case.
>
> Generally while checking for the interrupt we should not be holding
> any lwlock that means we don't have the risk of holding any buffer
> locks, so any other reader can continue to read from those buffers.
> We will only be holding some heavyweight locks like relation/tuple
> lock but that will not impact anyone except the writers trying to
> update the same tuple or the DDL who want to modify the table
> definition so I don't think we have any issue here because anyway we
> don't want other writers to continue.
>

Yes, it should be ok. I was just making it explicit on Andres' previous
comment on holding the heavyweight locks when wait before the commit.

>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2022-01-05 Thread SATYANARAYANA NARLAPURAM
On Wed, Jan 5, 2022 at 9:46 AM Andres Freund  wrote:

> Hi,
>
> On 2021-12-29 11:31:51 -0800, Andres Freund wrote:
> > That's pretty much the same - XLogInsert() can trigger an
> > XLogWrite()/Flush().
> >
> > I think it's a complete no-go to add throttling to these places. It's
> quite
> > possible that it'd cause new deadlocks, and it's almost guaranteed to
> have
> > unintended consequences (e.g. replication falling back further because
> > XLogFlush() is being throttled).
>
> I thought of another way to implement this feature. What if we checked the
> current distance somewhere within XLogInsert(), but only set
> InterruptPending=true, XLogDelayPending=true. Then in ProcessInterrupts()
> we
> check if XLogDelayPending is true and sleep the appropriate time.
>
> That way the sleep doesn't happen with important locks held / within a
> critical section, but we still delay close to where we went over the
> maximum
> lag. And the overhead should be fairly minimal.
>

+1 to the idea, this way we can fairly throttle large and
smaller transactions the same way. I will work on this model and share the
patch. Please note that the lock contention still exists in this case.


> I'm doubtful that implementing the waits on a transactional level provides
> a
> meaningful enough amount of control - there's just too much WAL that can be
> generated within a transaction.
>


>
> Greetings,
>
> Andres Freund
>


Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-01-03 Thread SATYANARAYANA NARLAPURAM
Thanks Michael!

On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier  wrote:

> On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote:
> > I noticed that pg_receivewal fails to stream when the partial file to
> write
> > is not fully initialized and fails with the error message something like
> > below. This requires an extra step of deleting the partial file that is
> not
> > fully initialized before starting the pg_receivewal. Attaching a simple
> > patch that creates a temp file, fully initialize it and rename the file
> to
> > the desired wal segment name.
>
> Are you referring to the pre-padding when creating a new partial
> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
> the file is fully created?  What kind of error did you see?  I guess
> that a write() with ENOSPC would be more likely, but you got a
> different problem?


I see two cases, 1/ when no space  is left on the device and 2/ when the
process is taken down forcibly (a VM/container crash)


>   I don't disagree with improving such cases, but we
> should not do things so as there is a risk of leaving behind an
> infinite set of segments in case of repeated errors


Do you see a problem with the proposed patch that leaves the files behind,
at least in my testing I don't see any files left behind?


> , and partial
> segments are already a kind of temporary file.
>

if the .partial file exists with not zero-padded up to the wal segment size
(WalSegSz), then open_walfile fails with the below error. I have two
options here, 1/ to continue padding the existing partial file and let it
zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought
the latter is safe because it can handle corrupt cases as described below.
Thoughts?

* When streaming to files, if an existing file exists we verify that it's
* either empty (just created), or a complete WalSegSz segment (in which
* case it has been created and padded). Anything else indicates a corrupt
* file. Compressed files have no need for padding, so just ignore this
* case.


>
> -   if (dir_data->sync)
> +   if (shouldcreatetempfile)
> +   {
> +   if (durable_rename(tmpsuffixpath, targetpath) != 0)
> +   {
> +   close(fd);
> +   unlink(tmpsuffixpath);
> +   return NULL;
> +   }
> +   }
>
> durable_rename() does a set of fsync()'s, but --no-sync should not
> flush any data.
>
I need to look into this further, without this I am seeing random file
close and rename failures and disconnecting the stream. Also it appears we
are calling durable_rename when we are closing the file (dir_close) even
without --no-sync. Should we worry about the padding case?

> --
> Michael
>


pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-01-02 Thread SATYANARAYANA NARLAPURAM
Hi Hackers,

I noticed that pg_receivewal fails to stream when the partial file to write
is not fully initialized and fails with the error message something like
below. This requires an extra step of deleting the partial file that is not
fully initialized before starting the pg_receivewal. Attaching a simple
patch that creates a temp file, fully initialize it and rename the file to
the desired wal segment name.

"error: write-ahead log file "00010003.partial" has 8396800
bytes, should be 0 or 16777216"

Thanks,
Satya


pg_receivewal_init_fix.patch
Description: Binary data


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-30 Thread SATYANARAYANA NARLAPURAM
On Thu, Dec 30, 2021 at 12:20 AM Dilip Kumar  wrote:

> On Thu, Dec 30, 2021 at 1:41 PM Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote:
>
>>
>> >
>> > Yeah, that's true, but even if we are blocking the transactions from
>> committing then also it is possible that a new connection can come and
>> generate more WAL,  yeah but I agree with the other part that if you
>> throttle after committing then the user can cancel the queries and generate
>> more WAL from those sessions as well.  But that is an extreme case where
>> application developers want to bypass the throttling and want to generate
>> more WALs.
>>
>> How about having the new hook at the start of the new txn?  If we do
>> this, when the limit for the throttling is exceeded, the current txn
>> (even if it is a long running one) continues to do the WAL insertions,
>> the next txns would get blocked. Thoughts?
>>
>
> Do you mean while StartTransactionCommand or while assigning a new
> transaction id? If it is at StartTransactionCommand then we would be
> blocking the sessions which are only performing read queries right?
>

Definitely not at StartTransactionCommand but possibly while assigning
transaction Id inAssignTransactionId. Blocking readers is never the intent.


> If we are doing at the transaction assignment level then we might be
> holding some of the locks so this might not be any better than throttling
> inside the commit.
>

If we define RPO as no transaction can commit when the wal_distance is more
than configured MB, we had to throttle the writes before committing the
transaction and new WAL generation by new connections or active doesn't
matter as the transactions can't be committed and visible to the user. If
the RPO is defined as no new write transactions allowed when wal_distance >
configured MB, then we can block assigning the new transaction IDs until
the RPO policy is met. IMHO, following the sync replication semantics is
easier and more explainable as it is already familiar to the customers.


>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
On Wed, Dec 29, 2021 at 10:38 PM Dilip Kumar  wrote:

> On Thu, Dec 30, 2021 at 1:09 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2021-12-29 11:34:53 -0800, SATYANARAYANA NARLAPURAM wrote:
>> > On Wed, Dec 29, 2021 at 11:31 AM Andres Freund 
>> wrote:
>> > Andres, thanks for the comments. Agreed on this based on the previous
>> > discussions on this thread. Could you please share your thoughts on
>> adding
>> > it after SyncRepWaitForLSN()?
>>
>> I don't think that's good either - you're delaying transaction commit
>> (i.e. xact becoming visible / locks being released).
>
>
> Agree with that.
>
>
>> That also has the danger
>> of increasing lock contention (albeit more likely to be heavyweight locks
>> /
>> serializable state). It'd have to be after the transaction actually
>> committed.
>>
>
> Yeah, I think that would make sense, even though we will be allowing a new
> backend to get connected insert WAL, and get committed but after that, it
> will be throttled.  However, if the number of max connections will be very
> high then even after we detected a lag there a significant amount WAL could
> be generated, even if we keep long-running transactions aside.  But I think
> still it will serve the purpose of what Satya is trying to achieve.
>

I am afraid there are problems with making the RPO check post releasing the
locks. By this time the transaction is committed and visible to the other
backends (ProcArrayEndTransaction is already called) though the intention
is to block committing transactions that violate the defined RPO. Even
though we block existing connections starting a new transaction, it is
possible to do writes by opening a new connection / canceling the query. I
am not too much worried about the lock contention as the system is already
hosed because of the policy. This behavior is very similar to what
happens when the Sync standby is not responding. Thoughts?




>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Logging replication state changes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
On Wed, Dec 29, 2021 at 2:04 PM Tom Lane  wrote:

> SATYANARAYANA NARLAPURAM  writes:
> > I noticed that below critical replication state changes are DEBUG1 level
> > logged. Any concern about changing the below two messages log level to
> LOG?
>
> Why?  These seem like perfectly routine messages.
>

Consider a scenario where we have a primary and two sync standby (s1 and
s2) where s1 is a preferred failover target and s2 is next with
synchronous_standby_names = 'First 1 ('s1','s2')'.  In an event, s1
streaming replication is broken and reestablished because of a planned or
an unplanned event then s2 participates in the sync commits and makes sure
the writes are not stalled on the primary. I would like to know the time
window where s1 is not actively acknowledging the commits and the writes
are dependent on s2. Also if the service layer decides to failover to s2
instead of s1 because s1 is lagging I need evidence in the log to explain
the behavior.



> regards, tom lane
>


Logging replication state changes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
Hi hackers,

I noticed that below critical replication state changes are DEBUG1 level
logged. Any concern about changing the below two messages log level to LOG?
If this is too verbose, we can introduce a new GUC,
log_replication_state_changes that logs the replication state changes when
enabled irrespective of the log level.

1/

/*
 * If we're in catchup state, move to streaming.  This is an
 * important state change for users to know about, since before
 * this point data loss might occur if the primary dies and we
 * need to failover to the standby. The state change is also
 * important for synchronous replication, since commits that
 * started to wait at that point might wait for some time.
 */
if (MyWalSnd->state == WALSNDSTATE_CATCHUP)
{
ereport(DEBUG1,
(errmsg_internal("\"%s\" has now caught up with upstream server",
application_name)));
WalSndSetState(WALSNDSTATE_STREAMING);
}

2/

ereport(DEBUG1,
(errmsg_internal("standby \"%s\" now has synchronous standby priority %u",
application_name, priority)));


Thanks,
Satya


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
On Wed, Dec 29, 2021 at 11:31 AM Andres Freund  wrote:

> Hi,
>
> On 2021-12-27 16:40:28 -0800, SATYANARAYANA NARLAPURAM wrote:
> > > Yet another problem is that if we are in XlogInsert() that means we are
> > > holding the buffer locks on all the pages we have modified, so if we
> add a
> > > hook at that level which can make it wait then we would also block any
> of
> > > the read operations needed to read from those buffers.  I haven't
> thought
> > > what could be better way to do this but this is certainly not good.
> > >
> >
> > Yes, this is a problem. The other approach is adding a hook at
> > XLogWrite/XLogFlush?
>
> That's pretty much the same - XLogInsert() can trigger an
> XLogWrite()/Flush().
>
> I think it's a complete no-go to add throttling to these places. It's quite
> possible that it'd cause new deadlocks, and it's almost guaranteed to have
> unintended consequences (e.g. replication falling back further because
> XLogFlush() is being throttled).
>
> I also don't think it's a sane thing to add hooks to these places. It's
> complicated enough as-is, adding the chance for random other things to
> happen
> during such crucial operations will make it even harder to maintain.
>

Andres, thanks for the comments. Agreed on this based on the previous
discussions on this thread. Could you please share your thoughts on adding
it after SyncRepWaitForLSN()?


>
> Greetings,
>
> Andres Freund
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
On Wed, Dec 29, 2021 at 11:16 AM Stephen Frost  wrote:

> Greetings,
>
> On Wed, Dec 29, 2021 at 14:04 SATYANARAYANA NARLAPURAM <
> satyanarlapu...@gmail.com> wrote:
>
>> Stephen, thank you!
>>
>> On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost  wrote:
>>
>>> Greetings,
>>>
>>> * SATYANARAYANA NARLAPURAM (satyanarlapu...@gmail.com) wrote:
>>> > On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar 
>>> wrote:
>>> > > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
>>> > > satyanarlapu...@gmail.com> wrote:
>>> > >>> Actually all the WAL insertions are done under a critical section
>>> > >>> (except few exceptions), that means if you see all the references
>>> of
>>> > >>> XLogInsert(), it is always called under the critical section and
>>> that is my
>>> > >>> main worry about hooking at XLogInsert level.
>>> > >>>
>>> > >>
>>> > >> Got it, understood the concern. But can we document the limitations
>>> of
>>> > >> the hook and let the hook take care of it? I don't expect an error
>>> to be
>>> > >> thrown here since we are not planning to allocate memory or make
>>> file
>>> > >> system calls but instead look at the shared memory state and add
>>> delays
>>> > >> when required.
>>> > >>
>>> > >>
>>> > > Yet another problem is that if we are in XlogInsert() that means we
>>> are
>>> > > holding the buffer locks on all the pages we have modified, so if we
>>> add a
>>> > > hook at that level which can make it wait then we would also block
>>> any of
>>> > > the read operations needed to read from those buffers.  I haven't
>>> thought
>>> > > what could be better way to do this but this is certainly not good.
>>> > >
>>> >
>>> > Yes, this is a problem. The other approach is adding a hook at
>>> > XLogWrite/XLogFlush? All the other backends will be waiting behind the
>>> > WALWriteLock. The process that is performing the write enters into a
>>> busy
>>> > loop with small delays until the criteria are met. Inability to
>>> process the
>>> > interrupts inside the critical section is a challenge in both
>>> approaches.
>>> > Any other thoughts?
>>>
>>> Why not have this work the exact same way sync replicas do, except that
>>> it's based off of some byte/time lag for some set of async replicas?
>>> That is, in RecordTransactionCommit(), perhaps right after the
>>> SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
>>> seems like there's a lot of similarity.
>>>
>>
>> I was thinking of achieving log governance (throttling WAL MB/sec) and
>> also providing RPO guarantees. In this model, it is hard to throttle WAL
>> generation of a long running transaction (for example copy/select into).
>>
>
> Long running transactions have a lot of downsides and are best
> discouraged. I don’t know that we should be designing this for that case
> specifically, particularly given the complications it would introduce as
> discussed on this thread already.
>
> However, this meets my RPO needs. Are you in support of adding a hook or
>> the actual change? IMHO, the hook allows more creative options. I can go
>> ahead and make a patch accordingly.
>>
>
> I would think this would make more sense as part of core rather than a
> hook, as that then requires an extension and additional setup to get going,
> which raises the bar quite a bit when it comes to actually being used.
>

Sounds good, I will work on making the changes accordingly.

>
> Thanks,
>
> Stephen
>
>>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-29 Thread SATYANARAYANA NARLAPURAM
Stephen, thank you!

On Wed, Dec 29, 2021 at 5:46 AM Stephen Frost  wrote:

> Greetings,
>
> * SATYANARAYANA NARLAPURAM (satyanarlapu...@gmail.com) wrote:
> > On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar 
> wrote:
> > > On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> > > satyanarlapu...@gmail.com> wrote:
> > >>> Actually all the WAL insertions are done under a critical section
> > >>> (except few exceptions), that means if you see all the references of
> > >>> XLogInsert(), it is always called under the critical section and
> that is my
> > >>> main worry about hooking at XLogInsert level.
> > >>>
> > >>
> > >> Got it, understood the concern. But can we document the limitations of
> > >> the hook and let the hook take care of it? I don't expect an error to
> be
> > >> thrown here since we are not planning to allocate memory or make file
> > >> system calls but instead look at the shared memory state and add
> delays
> > >> when required.
> > >>
> > >>
> > > Yet another problem is that if we are in XlogInsert() that means we are
> > > holding the buffer locks on all the pages we have modified, so if we
> add a
> > > hook at that level which can make it wait then we would also block any
> of
> > > the read operations needed to read from those buffers.  I haven't
> thought
> > > what could be better way to do this but this is certainly not good.
> > >
> >
> > Yes, this is a problem. The other approach is adding a hook at
> > XLogWrite/XLogFlush? All the other backends will be waiting behind the
> > WALWriteLock. The process that is performing the write enters into a busy
> > loop with small delays until the criteria are met. Inability to process
> the
> > interrupts inside the critical section is a challenge in both approaches.
> > Any other thoughts?
>
> Why not have this work the exact same way sync replicas do, except that
> it's based off of some byte/time lag for some set of async replicas?
> That is, in RecordTransactionCommit(), perhaps right after the
> SyncRepWaitForLSN() call, or maybe even add this to that function?  Sure
> seems like there's a lot of similarity.
>

I was thinking of achieving log governance (throttling WAL MB/sec) and also
providing RPO guarantees. In this model, it is hard to throttle WAL
generation of a long running transaction (for example copy/select into).
However, this meets my RPO needs. Are you in support of adding a hook or
the actual change? IMHO, the hook allows more creative options. I can go
ahead and make a patch accordingly.




> Thanks,
>
> Stephen
>


Re: Report checkpoint progress in server logs

2021-12-29 Thread SATYANARAYANA NARLAPURAM
  Coincidentally, I was thinking about the same yesterday after tired of
waiting for the checkpoint completion on a server.

On Wed, Dec 29, 2021 at 7:41 AM Tom Lane  wrote:

> Magnus Hagander  writes:
> >> Therefore, reporting the checkpoint progress in the server logs, much
> >> like [1], seems to be the best way IMO.
>
> > I find progress reporting in the logfile to generally be a terrible
> > way of doing things, and the fact that we do it for the startup
> > process is/should be only because we have no other choice, not because
> > it's the right choice.
>
> I'm already pretty seriously unhappy about the log-spamming effects of
> 64da07c41 (default to log_checkpoints=on), and am willing to lay a side
> bet that that gets reverted after we have some field experience with it.
> This proposal seems far worse from that standpoint.  Keep in mind that
> our out-of-the-box logging configuration still doesn't have any log
> rotation ability, which means that the noisier the server is in normal
> operation, the sooner you fill your disk.
>

Server is not open up for the queries while running the end of recovery
checkpoint and a catalog view may not help here but the process title
change or logging would be helpful in such cases. When the server is
running the recovery, anxious customers ask several times the ETA for
recovery completion, and not having visibility into these operations makes
life difficult for the DBA/operations.


>
> > I think the right choice to solve the *general* problem is the
> > mentioned pg_stat_progress_checkpoints.
>
> +1
>

+1 to this. We need at least a trace of the number of buffers to sync
(num_to_scan) before the checkpoint start, instead of just emitting the
stats at the end.


Bharat, it would be good to show the buffers synced counter and the total
buffers to sync, checkpointer pid, substep it is running, whether it is on
target for completion, checkpoint_Reason (manual/times/forced). BufferSync
has several variables tracking the sync progress locally, and we may need
some refactoring here.


>
> regards, tom lane
>
>
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-27 Thread SATYANARAYANA NARLAPURAM
On Sat, Dec 25, 2021 at 9:25 PM Dilip Kumar  wrote:

> On Sun, Dec 26, 2021 at 10:36 AM SATYANARAYANA NARLAPURAM <
> satyanarlapu...@gmail.com> wrote:
>
>>
>>> Actually all the WAL insertions are done under a critical section
>>> (except few exceptions), that means if you see all the references of
>>> XLogInsert(), it is always called under the critical section and that is my
>>> main worry about hooking at XLogInsert level.
>>>
>>
>> Got it, understood the concern. But can we document the limitations of
>> the hook and let the hook take care of it? I don't expect an error to be
>> thrown here since we are not planning to allocate memory or make file
>> system calls but instead look at the shared memory state and add delays
>> when required.
>>
>>
> Yet another problem is that if we are in XlogInsert() that means we are
> holding the buffer locks on all the pages we have modified, so if we add a
> hook at that level which can make it wait then we would also block any of
> the read operations needed to read from those buffers.  I haven't thought
> what could be better way to do this but this is certainly not good.
>

Yes, this is a problem. The other approach is adding a hook at
XLogWrite/XLogFlush? All the other backends will be waiting behind the
WALWriteLock. The process that is performing the write enters into a busy
loop with small delays until the criteria are met. Inability to process the
interrupts inside the critical section is a challenge in both approaches.
Any other thoughts?


>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-25 Thread SATYANARAYANA NARLAPURAM
On Sat, Dec 25, 2021 at 6:01 PM Dilip Kumar  wrote:

> On Sun, Dec 26, 2021 at 3:52 AM SATYANARAYANA NARLAPURAM <
> satyanarlapu...@gmail.com> wrote:
>
>>
>>
>> On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar 
>> wrote:
>>
>>> On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
>>> satyanarlapu...@gmail.com> wrote:
>>>
>>>>
>>>>>
>>>> XLogInsert in my opinion is the best place to call it and the hook can
>>>> be something like this "void xlog_insert_hook(NULL)" as all the throttling
>>>> logic required is the current flush position which can be obtained
>>>> from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
>>>>
>>>
>>> IMHO, it is not a good idea to call an external hook function inside a
>>> critical section.  Generally, we ensure that we do not call any code path
>>> within a critical section which can throw an error and if we start calling
>>> the external hook then we lose that control.
>>>
>>
>> Thank you for the comment. XLogInsertRecord is inside a critical section
>> but not XLogInsert. Am I missing something?
>>
>
> Actually all the WAL insertions are done under a critical section (except
> few exceptions), that means if you see all the references of XLogInsert(),
> it is always called under the critical section and that is my main worry
> about hooking at XLogInsert level.
>

Got it, understood the concern. But can we document the limitations of the
hook and let the hook take care of it? I don't expect an error to be thrown
here since we are not planning to allocate memory or make file system calls
but instead look at the shared memory state and add delays when required.



>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-25 Thread SATYANARAYANA NARLAPURAM
On Fri, Dec 24, 2021 at 3:13 AM Dilip Kumar  wrote:

> On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
> satyanarlapu...@gmail.com> wrote:
>
>>
>>>
>> XLogInsert in my opinion is the best place to call it and the hook can be
>> something like this "void xlog_insert_hook(NULL)" as all the throttling
>> logic required is the current flush position which can be obtained
>> from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
>>
>
> IMHO, it is not a good idea to call an external hook function inside a
> critical section.  Generally, we ensure that we do not call any code path
> within a critical section which can throw an error and if we start calling
> the external hook then we lose that control.
>

Thank you for the comment. XLogInsertRecord is inside a critical section
but not XLogInsert. Am I missing something?


> It should be blocked at the operation level itself e.g. ALTER TABLE READ
> ONLY, or by some other hook at a little higher level.
>

There is a lot of maintenance overhead with a custom implementation at
individual databases and tables level. This doesn't provide the necessary
control that I am looking for.




>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Fwd: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-23 Thread SATYANARAYANA NARLAPURAM
Please find the attached draft patch.

On Thu, Dec 23, 2021 at 2:47 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
>  wrote:
> >
> > Hi Hackers,
> >
> > I am considering implementing RPO (recovery point objective) enforcement
> feature for Postgres where the WAL writes on the primary are stalled when
> the WAL distance between the primary and standby exceeds the configured
> (replica_lag_in_bytes) threshold. This feature is useful particularly in
> the disaster recovery setups where primary and standby are in different
> regions and synchronous replication can't be set up for latency and
> performance reasons yet requires some level of RPO enforcement.
>
> +1 for the idea in general. However, blocking writes on primary seems
> an extremely radical idea. The replicas can fall behind transiently at
> times and blocking writes on the primary may stop applications failing
> for these transient times. This is not a problem if the applications
> have retry logic for the writes. How about blocking writes on primary
> if the replicas fall behind the primary for a certain period of time?
>

My proposal is to block the caller from writing until the lag situation is
improved. Don't want to throw any errors and fail the tranaction. I think
we are aligned?


>
> > The idea here is to calculate the lag between the primary and the
> standby (Async?) server during XLogInsert and block the caller until the
> lag is less than the threshold value. We can calculate the max lag by
> iterating over ReplicationSlotCtl->replication_slots.
>
> The "falling behind" can also be quantified by the number of
> write-transactions on the primary. I think it's good to have the users
> choose what the "falling behind" means for them. We can have something
> like the "recovery_target" param with different options name, xid,
> time, lsn.
>

The transactions can be of arbitrary size and length and these options may
not provide the desired results. Time is a worthy option to add.


>
> > If this is not something we don't want to do in the core, at least
> adding a hook for XlogInsert is of great value.
>
> IMHO, this feature may not be needed by everyone, the hook-way seems
> reasonable so that the postgres vendors can provide different
> implementations (for instance they can write an extension that
> implements this hook which can block writes on primary, write some log
> messages, inform some service layer of the replicas falling behind the
> primary etc.). If we were to have the hook in XLogInsert which gets
> called so frequently or XLogInsert is a hot-path, the hook really
> should do as little work as possible, otherwise the write operations
> latency may increase.
>

A Hook is a good start. If there is enough interest then an extension can
be added to the contrib module.


> > A few other scenarios I can think of with the hook are:
> >
> > Enforcing RPO as described above
> > Enforcing rate limit and slow throttling when sync standby is falling
> behind (could be flush lag or replay lag)
> > Transactional log rate governance - useful for cloud providers to
> provide SKU sizes based on allowed WAL writes.
> >
> > Thoughts?
>
> The hook can help to achieve the above objectives but where to place
> it and what parameters it should take as input (or what info it should
> emit out of the server via the hook) are important too.
>

XLogInsert in my opinion is the best place to call it and the hook can be
something like this "void xlog_insert_hook(NULL)" as all the throttling
logic required is the current flush position which can be obtained
from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.


>
> Having said all, the RPO feature can also be implemented outside of
> the postgres, a simple implementation could be - get the primary
> current wal lsn using pg_current_wal_lsn and all the replicas
> restart_lsn using pg_replication_slot, if they differ by certain
> amount, then issue ALTER SYSTEM SET READ ONLY command [1] on the
> primary, this requires the connections to the server and proper access
> rights. This feature can also be implemented as an extension (without
> the hook) which doesn't require any connections to the server yet can
> access the required info primary current_wal_lsn, restart_lsn of the
> replication slots etc, but the RPO enforcement may not be immediate as
> the server doesn't have any hooks in XLogInsert or some other area.
>

READ ONLY is a decent choice but can fail the writes or not take
into effect until the end of the transaction?


> [1] -
> https://www.postgresql.org/message-id/CAAJ_b967uKBiW6gbHr5aPzweURYjEGv333FHVHxvJmMhanwHXA%40mail.gmail.com
>
> Regards,
> Bharath Rupireddy.
>


0001-Add-xlog_insert_hook-to-give-control-to-the-plugins.patch
Description: Binary data


Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-23 Thread SATYANARAYANA NARLAPURAM
On Thu, Dec 23, 2021 at 5:18 AM Ashutosh Bapat 
wrote:

> On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
>  wrote:
> >
> > Hi Hackers,
> >
> > I am considering implementing RPO (recovery point objective) enforcement
> feature for Postgres where the WAL writes on the primary are stalled when
> the WAL distance between the primary and standby exceeds the configured
> (replica_lag_in_bytes) threshold. This feature is useful particularly in
> the disaster recovery setups where primary and standby are in different
> regions and synchronous replication can't be set up for latency and
> performance reasons yet requires some level of RPO enforcement.
>
> Limiting transaction rate when the standby fails behind is a good feature
> ...
>
> >
> > The idea here is to calculate the lag between the primary and the
> standby (Async?) server during XLogInsert and block the caller until the
> lag is less than the threshold value. We can calculate the max lag by
> iterating over ReplicationSlotCtl->replication_slots. If this is not
> something we don't want to do in the core, at least adding a hook for
> XlogInsert is of great value.
>
> but doing it in XLogInsert does not seem to be a good idea.


XLogInsert isn't the best place to throttle/govern in a simple and fair
way, particularly the long-running transactions on the server?


> It's a
> common point for all kinds of logging including VACUUM. We could
> accidently stall a critical VACUUM operation because of that.
>

Agreed, but again this is a policy decision that DBA can relax/enforce. I
expect RPO is in the range of a few 100MBs to GBs and on a healthy system
typically lag never comes close to this value. The Hook implementation can
take care of nitty-gritty details on the policy enforcement based on the
needs, for example, not throttling some backend processes like vacuum,
checkpointer; throttling based on the roles, for example not to throttle
superuser connections; and throttling based on replay lag, write lag,
checkpoint taking longer, closer to disk full. Each of these can be easily
translated into GUCs. Depending on the direction of the thread on the hook
vs a feature in the Core, I can add more implementation details.



> As Bharath described, it better be handled at the application level
> monitoring.
>

Both RPO based WAL throttling and application level monitoring can co-exist
as each one has its own merits and challenges. Each application developer
has to implement their own throttling logic and often times it is hard to
get it right.


> --
> Best Wishes,
> Ashutosh Bapat
>


Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-22 Thread SATYANARAYANA NARLAPURAM
Hi Hackers,

I am considering implementing RPO (recovery point objective) enforcement
feature for Postgres where the WAL writes on the primary are stalled when
the WAL distance between the primary and standby exceeds the configured
(replica_lag_in_bytes) threshold. This feature is useful particularly in
the disaster recovery setups where primary and standby are in different
regions and synchronous replication can't be set up for latency and
performance reasons yet requires some level of RPO enforcement.

The idea here is to calculate the lag between the primary and the standby
(Async?) server during XLogInsert and block the caller until the lag is
less than the threshold value. We can calculate the max lag by iterating
over ReplicationSlotCtl->replication_slots. If this is not something we
don't want to do in the core, at least adding a hook for XlogInsert is of
great value.

A few other scenarios I can think of with the hook are:

   1. Enforcing RPO as described above
   2. Enforcing rate limit and slow throttling when sync standby is falling
   behind (could be flush lag or replay lag)
   3. Transactional log rate governance - useful for cloud providers to
   provide SKU sizes based on allowed WAL writes.

Thoughts?

Thanks,
Satya


Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-06 Thread SATYANARAYANA NARLAPURAM
If the segment size is 16MB it shouldn't take much time but higher segment
values this can be a problem. But again, the current segment has to be
filled 75% to precreate new one. I am not sure how much we gain. Do you
have some numbers with different segment sizes?

On Mon, Dec 6, 2021 at 4:51 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> The function PreallocXlogFiles doesn't get called during
> end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> becomes operational after the end-of-recovery checkpoint and may need
> WAL files. However, I'm not sure how beneficial it is going to be if
> the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> extra WAL file).
>
> Thoughts?
>
> [1]
> /*
>  * An end-of-recovery checkpoint is really a shutdown checkpoint, just
>  * issued at a different time.
>  */
> if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
> shutdown = true;
> else
> shutdown = false;
>
> /*
>  * Make more log segments if needed.  (Do this after recycling old log
>  * segments, since that may supply some of the needed files.)
>  */
> if (!shutdown)
> PreallocXlogFiles(recptr, checkPoint.ThisTimeLineID);
>
> Regards,
> Bharath Rupireddy.
>
>
>


Re: O(n) tasks cause lengthy startups and checkpoints

2021-12-01 Thread SATYANARAYANA NARLAPURAM
+1 to the idea. I don't see a reason why checkpointer has to do all of
that. Keeping checkpoint to minimal essential work helps servers recover
faster in the event of a crash.

RemoveOldXlogFiles is also an O(N) operation that can at least be avoided
during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a
sufficient number of WAL files accumulated and the previous checkpoint did
not get a chance to cleanup, this can increase the unavailability of the
server.

RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);



On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan  wrote:

> Hi hackers,
>
> Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
> avoid individually syncing all database files after a crash.  However,
> as noted earlier this year [0], there are still a number of O(n) tasks
> that affect startup and checkpointing that I'd like to improve.
> Below, I've attempted to summarize each task and to offer ideas for
> improving matters.  I'll likely split each of these into its own
> thread, given there is community interest for such changes.
>
> 1) CheckPointSnapBuild(): This function loops through
>pg_logical/snapshots to remove all snapshots that are no longer
>needed.  If there are many entries in this directory, this can take
>a long time.  The note above this function indicates that this is
>done during checkpoints simply because it is convenient.  IIUC
>there is no requirement that this function actually completes for a
>given checkpoint.  My current idea is to move this to a new
>maintenance worker.
> 2) CheckPointLogicalRewriteHeap(): This function loops through
>pg_logical/mappings to remove old mappings and flush all remaining
>ones.  IIUC there is no requirement that the "remove old mappings"
>part must complete for a given checkpoint, but the "flush all
>remaining" portion allows replay after a checkpoint to only "deal
>with the parts of a mapping that have been written out after the
>checkpoint started."  Therefore, I think we should move the "remove
>old mappings" part to a new maintenance worker (probably the same
>one as for 1), and we should consider using syncfs() for the "flush
>all remaining" part.  (I suspect the main argument against the
>latter will be that it could cause IO spikes.)
> 3) RemovePgTempFiles(): This step can delay startup if there are many
>temporary files to individually remove.  This step is already
>optionally done after a crash via the remove_temp_files_after_crash
>GUC.  I propose that we have startup move the temporary file
>directories aside and create new ones, and then a separate worker
>(probably the same one from 1 and 2) could clean up the old files.
> 4) StartupReorderBuffer(): This step deletes logical slot data that
>has been spilled to disk.  This code appears to be written to avoid
>deleting different types of files in these directories, but AFAICT
>there shouldn't be any other files.  Therefore, I think we could do
>something similar to 3 (i.e., move the directories aside during
>startup and clean them up via a new maintenance worker).
>
> I realize adding a new maintenance worker might be a bit heavy-handed,
> but I think it would be nice to have somewhere to offload tasks that
> really shouldn't impact startup and checkpointing.  I imagine such a
> process would come in handy down the road, too.  WDYT?
>
> Nathan
>
> [0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com
>
>


Re: pg_replslotdata - a tool for displaying replication slot information

2021-11-30 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 30, 2021 at 9:47 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Dec 1, 2021 at 12:13 AM Bossart, Nathan 
> wrote:
> >
> > On 11/30/21, 6:14 AM, "Peter Eisentraut" <
> peter.eisentr...@enterprisedb.com> wrote:
> > > On 23.11.21 06:09, Bharath Rupireddy wrote:
> > >> The replication slots data is stored in binary format on the disk
> under
> > >> the pg_replslot/<> directory which isn't human readable. If
> > >> the server is crashed/down (for whatever reasons) and unable to come
> up,
> > >> currently there's no way for the user/admin/developer to know what
> were
> > >> all the replication slots available at the time of server crash/down
> to
> > >> figure out what's the restart lsn, xid, two phase info or types of
> slots
> > >> etc.
> > >
> > > What do you need that for?  You can't do anything with a replication
> > > slot while the server is down.
> >
> > One use-case might be to discover the value you need to set for
> > max_replication_slots, although it's pretty trivial to discover the
> > number of replication slots by looking at the folder directly.
>
> Apart from the above use-case, one can do some exploratory analysis on
> the replication slot information after the server crash, this may be
> useful for RCA or debugging purposes, for instance:
> 1) to look at the restart_lsn of the slots to get to know why there
> were many WAL files filled up on the disk (because of the restart_lsn
> being low)
>

In a disk full scenario because of WAL, this tool comes handy identifying
which WAL files to delete  to free up the space and also help assess the
accidental delete of the WAL files. I am not sure if there is a tool to
help cleanup the WAL (may be invoking the archive_command too?) without
impacting physical / logical slots, and respecting last checkpoint location
but if one exist that will be handy


Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-30 Thread SATYANARAYANA NARLAPURAM
On Tue, Nov 30, 2021 at 4:54 PM Michael Paquier  wrote:

> On Tue, Nov 30, 2021 at 05:58:15PM -0500, David Steele wrote:
> > The main objections as I recall are that it is much harder for simple
> backup
> > scripts and commercial backup integrations to hold a connection to
> postgres
> > open and write the backup label separately into the backup.
>
> I don't quite understand why this argument would not hold even today,
> even if I'd like to think that more people are using pg_basebackup.
>
> > I did figure out how to keep the safe part of exclusive backup (not
> having
> > to maintain a connection) while removing the dangerous part (writing
> > backup_label into PGDATA), but it was a substantial amount of work and I
> > felt that it had little chance of being committed.
>
> Which was, I guess, done by storing the backup_label contents within a
> file different than backup_label, still maintained in the main data
> folder to ensure that it gets included in the backup?
>

Non-exclusive backup has significant advantages over exclusive backups but
would like to add a few comments on the simplicity of exclusive backups -
1/ It is not uncommon nowadays to take a snapshot based backup. Exclusive
backup simplifies this story as the backup label file is part of the
snapshot. Otherwise, one needs to store it somewhere outside as snapshot
metadata and copy this file over during restore (after creating a disk from
the snapshot) to the data directory. Typical steps included are 1/ start
pg_base_backup 2/ Take disk snapshot 3/ pg_stop_backup() 4/ Mark snapshot
as consistent and add some create time metadata.
2/ Control plane code responsible for taking backups is simpler with
exclusive backups than non-exclusive as it doesn't maintain a connection to
the server, particularly when that orchestration is outside the machine the
Postgres server is running on.

IMHO, we should either remove the support for it or improve it but not
leave it hanging there.


Re: Synchronizing slots from primary to standby

2021-11-28 Thread SATYANARAYANA NARLAPURAM
> 3) Instead of the subscriber pulling the slot info, why can't the
> publisher (via the walsender or a new bg worker maybe?) push the
> latest slot info? I'm not sure we want to add more functionality to
> the walsender, if yes, isn't it going to be much simpler?
>

Standby pulling the information or at least making a first attempt to
connect to the  primary is a better design as primary doesn't need to spend
its cycles repeatedly connecting to an unreachable standby. In fact,
primary wouldn't even need to know the followers, for example followers /
log shipping standbys


Switching XLog source from archive to streaming when primary available

2021-11-28 Thread SATYANARAYANA NARLAPURAM
Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog
source from streaming to archive and continues in that state until it can
get the WAL from the archive location. On a server with high WAL activity,
typically getting the WAL from the archive is slower than streaming it from
the primary and couldn't exit from that state. This not only increases the
lag on the standby but also adversely impacts the primary as the WAL gets
accumulated, and vacuum is not able to collect the dead tuples. DBAs as a
mitigation can however remove/advance the slot or remove the
restore_command on the standby but this is a manual work I am trying to
avoid. I would like to propose the following, please let me know your
thoughts.

   - Automatically attempt to switch the source from Archive to streaming
   when the primary_conninfo is set after replaying 'N' wal segment governed
   by the GUC retry_primary_conn_after_wal_segments
   - when  retry_primary_conn_after_wal_segments is set to -1 then the
   feature is disabled
   - When the retry attempt fails, then switch back to the archive

Thanks,
Satya


Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-25 Thread SATYANARAYANA NARLAPURAM
Thanks Michael!

This is a known issue with exclusive backups, which is a reason why
> non-exclusive backups have been implemented.  pg_basebackup does that,
> and using "false" as the third argument of pg_start_backup() would
> have the same effect.  So I would recommend to switch to that.
>

Is there a plan in place to remove the exclusive backup option from the
core in PG 15/16? If we are keeping it then why not make it better?


Postgres restart in the middle of exclusive backup and the presence of backup_label file

2021-11-24 Thread SATYANARAYANA NARLAPURAM
Hi Hackers,

While an exclusive backup is in progress if Postgres restarts, postgres
runs the recovery from the checkpoint identified by the label file instead
of the control file. This can cause long recovery or even sometimes fail to
recover as the WAL records corresponding to that checkpoint location are
removed. I can write a layer in my control plane to remove the backup_label
file when I know the server is not in restore from the base backup but I
don't see a reason why everyone has to repeat this step. Am I missing
something?

If there are no standby.signal or recovery.signal, what is the use case of
honoring backup_label file? Even when they exist, for a long running
recovery, should we honor the backup_label file as the majority of the WAL
already applied? It does slow down the recovery on restart right as it has
to start all the way from the beginning?

Thanks,
Satya


Re: Synchronous commit behavior during network outage

2021-04-21 Thread SATYANARAYANA NARLAPURAM
>
> This can be an option for us in our case. But there also needs to be a
> process how to detect these "stuck commits" and how to invalidate/remove
> them, because in reality, if the app/user would not see the change in the
> database, it/he/she will try to insert/delete it again. If it just stuck
> without management, it will create a queue which can cause, that in the
> queue there will be 2 similar inserts/deletes which can again cause issues
> (like with the primary key I mentioned before).
>

 This shouldn't be a problem as the previous transaction is still holding
the locks and the new transaction is blocked behind this. Outside of the
sync replication, this can happen today too with glitches/timeouts/ retries
between the client and the server. Am I missing something?


So the process should be in this case:
>
> - DBA receives information, that write operations stuck (DBA in
> coordination with the infrastructure team disconnects all clients and
> prevent new ones to create a new connection).
>
You can monitor the pg_stat_activity for the SYNC_REP_WAIT_FLUSH wait types
to detect this.


> - DBA will recognize, that there is an issue in communication between the
> primary and the sync replica (caused the issue with the propagation of
> commits)
> - DBA will see that there are some commits that are in the "stuck state"
> - DBA removes these stuck commits. Note: Because the client never received
> a confirmation about the successful commit -> changes in the DB client
> tried to perform can't be considered as successful.
>

You should consider these as in doubt transactions and the client should
retry. Again, this can happen in a normal server crash case too. For
example, a transaction committed on the server and before sending the
acknowledgement crashed.  The client should know how to handle these cases.

- DBA and infrastructure team restore the communication between server
> nodes to be able to propagate commits from the primary node to sync replica.
> - DBA and infrastructure team allows new connections to the database
>
> This approach would require external monitoring and alerting, but I would
> say, that this is an acceptable solution. Would your patch be able to
> perform that?
>
My patch handles ignoring the cancel events. I ended up keeping the other
logic (blocking super user connections in the client_authentication_hook.

There is a third problem that I didn't talk about in this thread where the
async clients (including logical decoding and replication clients) can get
ahead of the new primary and there is no easier way to undo those changes.
For this problem, we need to implement some protocol in the WAL sender
where it sends the log to the consumer only up to the flush LSN of the
standby/quorum replicas. This is something I am working on right now.


Re: Synchronous commit behavior during network outage

2021-04-20 Thread SATYANARAYANA NARLAPURAM
One idea here is to make the backend ignore query cancellation/backend
termination while waiting for the synchronous commit ACK. This way client
never reads the data that was never flushed remotely. The problem with this
approach is that your backends get stuck until your commit log record is
flushed on the remote side. Also, the client can see the data not flushed
remotely if the server crashes and comes back online. You can prevent the
latter case by making a SyncRepWaitForLSN before opening up the connections
to the non-superusers. I have a working prototype of this logic, if there
is enough interest I can post the patch.





On Tue, Apr 20, 2021 at 11:25 AM Ondřej Žižka 
wrote:

> I am sorry, I forgot mentioned, that in the second situation I added a
> primary key to the table.
>
> Ondrej
>
>
> On 20/04/2021 18:49, Ondřej Žižka wrote:
> > Hello Aleksander,
> >
> > Thank you for the reaction. This was tested on version 13.2.
> >
> > There are also other possible situations with the same setup and
> > similar issue:
> >
> > -
> > When the background process on server fails
> >
> > On postgresql1:
> > tecmint=# select * from a; --> LAN on sync replica is OK
> >  id
> > 
> >   1
> > (1 row)
> >
> > tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN
> > and insert is waiting. During this time kill the background process on
> > the PostgreSQL server for this session
> > WARNING:  canceling the wait for synchronous replication and
> > terminating connection due to administrator command
> > DETAIL:  The transaction has already committed locally, but might not
> > have been replicated to the standby.
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Succeeded.
> > tecmint=# select * from a;
> >  id
> > 
> >   1
> >   2
> > (2 rows)
> >
> > tecmint=# ---> LAN on sync replica is still DOWN
> >
> > The potgres session will restore after the background process failed.
> > When you run select on master, it still looks OK. But data is still
> > not replicated on the sync replica. If we lost the master now, we
> > would lost this data as well.
> >
> > **
> > Another case
> > **
> >
> > Kill the client process.
> >
> > tecmint=# select * from a;
> >  id
> > 
> >   1
> >   2
> >   3
> > (3 rows)
> > tecmint=#--> Disconnect the sync replica now. LAN on
> > replica is DOWN
> > tecmint=# insert into a values (4); --> Kill the client process
> > Terminated
> > xzizka@service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d
> tecmint
> > Password for user postgres:
> > psql (13.2 (Debian 13.2-1.pgdg100+1))
> > Type "help" for help.
> >
> > tecmint=# select * from a;
> >  id
> > 
> >   1
> >   2
> >   3
> > (3 rows)
> >
> > tecmint=# --> Number 4 is not there. Now switch the LAN on sync
> > replica ON.
> >
> > --
> >
> > Result from sync replica after the LAN is again UP:
> > tecmint=# select * from a;
> >  id
> > 
> >   1
> >   2
> >   3
> >   4
> > (4 rows)
> >
> >
> > In this situation, try to insert the number 4 again to the table.
> >
> > tecmint=# select * from a;
> >  id
> > 
> >   1
> >   2
> >   3
> > (3 rows)
> >
> > tecmint=# insert into a values (4);
> > ERROR:  duplicate key value violates unique constraint "a_pkey"
> > DETAIL:  Key (id)=(4) already exists.
> > tecmint=#
> >
> > This is really strange... Application can be confused, It is not
> > possible to insert record, which is not there, but some systems which
> > use the sync node as a read replica maybe already read that record
> > from the sync replica database and done some steps which can cause
> > issues and can be hard to track.
> >
> > If I say, that it would be hard to send the CTRL+C to the database
> > from the client, I need to say, that the 2 situations I described here
> > can happen in real.
> >
> > What do you think?
> >
> > Thank you and regards
> > Ondrej
> >
> > On 20/04/2021 17:23, Aleksander Alekseev wrote:
> >> Hi Ondřej,
> >>
> >> Thanks for the report. It seems to be a clear violation of what is
> >> promised in the docs. Although it's unlikely that someone implemented
> >> an application which deals with important data and "pressed Ctr+C" as
> >> it's done in psql. So this might be not such a critical issue after
> >> all. BTW what version of PostgreSQL are you using?
> >>
> >>
> >> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka
> >>  wrote:
> >>> Hello all,
> >>> I would like to know your opinion on the following behaviour I see
> >>> for PostgreSQL setup with synchronous replication.
> >>>
> >>> This behaviour happens in a special use case. In this use case,
> >>> there are 2 synchronous replicas with the following config (truncated):
> >>>
> >>> - 2 nodes
> >>> - synchronous_standby_names='*'
> >>> - synchronous_commit=remote_apply
> >>>
> >>>
> >>> With 

Re: when the startup process doesn't

2021-04-20 Thread SATYANARAYANA NARLAPURAM
+1 for both log messages and allowing connections. I believe these two
complement each other.

In the cloud world, we oftentimes want to monitor the progress of the
recovery without connecting to the server as the operators don't
necessarily have the required permissions to connect and query. Secondly,
having this information in the log helps going back in time and understand
where Postgres spent time during recovery.

The ability to query the server provides real time information  and come
handy.

Thanks,
Satya



On Mon, Apr 19, 2021 at 10:55 AM Robert Haas  wrote:

> Hi,
>
> I've noticed that customers not infrequently complain that they start
> postgres and then the system doesn't come up for a while and they have
> no idea what's going on and are (understandably) worried. There are
> probably a number of reasons why this can happen, but the ones that
> seem to come up most often in my experience are (1) SyncDataDirectory
> takes a long time, (b) ResetUnloggedRelations takes a long time, and
> (c) there's a lot of WAL to apply so that takes a long time. It's
> possible to distinguish this last case from the other two by looking
> at the output of 'ps', but that's not super-convenient if your normal
> method of access to the server is via libpq, and it only works if you
> are monitoring it as it's happening rather than looking at the logs
> after-the-fact. I am not sure there's any real way to distinguish the
> other two cases without using strace or gdb or similar.
>
> It seems to me that we could do better. One approach would be to try
> to issue a log message periodically - maybe once per minute, or some
> configurable interval, e.g. perhaps add messages something like this:
>
> LOG:  still syncing data directory, elapsed time %ld.%03d ms, current path
> %s
> LOG:  data directory sync complete after %ld.%03d ms
> LOG:  still resetting unlogged relations, elapsed time %ld.%03d ms,
> current path %s
> LOG:  unlogged relations reset after %ld.%03d ms
> LOG:  still performing crash recovery, elapsed time %ld.%03d ms,
> current LSN %08X/%08X
>
> We already have a message when redo is complete, so there's no need
> for another one. The implementation here doesn't seem too hard either:
> the startup process would set a timer, when the timer expires the
> signal handler sets a flag, at a convenient point we notice the flag
> is set and responding by printing a message and clearing the flag.
>
> Another possible approach would be to accept connections for
> monitoring purposes even during crash recovery. We can't allow access
> to any database at that point, since the system might not be
> consistent, but we could allow something like a replication connection
> (the non-database-associated variant). Maybe it would be precisely a
> replication connection and we'd just refuse all but a subset of
> commands, or maybe it would be some other kinds of thing. But either
> way you'd be able to issue a command in some mini-language saying "so,
> tell me how startup is going" and it would reply with a result set of
> some kind.
>
> If I had to pick one of these two ideas, I'd pick the one the
> log-based solution, since it seems easier to access and simplifies
> retrospective analysis, but I suspect SQL access would be quite useful
> for some users too, especially in cloud environments where "just log
> into the machine and have a look" is not an option.
>
> Thoughts?
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>
>


why pg_walfile_name() cannot be executed during recovery?

2021-04-02 Thread SATYANARAYANA NARLAPURAM
Hello Hackers,

Why pg_walfile_name() can't be executed under recovery? What is the best
way for me to get the current timeline and/or the file being recovering on
the standby using a postgres query? I know I can get it via process title
but don't want to go that route.

Thanks,
Satya


Re: Reduce/eliminate the impact of FPW

2020-08-03 Thread SATYANARAYANA NARLAPURAM
Increasing checkpoint_timeout helps reduce the amount of log written to the
disk. This has several benefits like, reduced number of WAL IO, archival
load on the system, less network traffic to the standby replicas. However,
this increases the crash recovery time and impact server availability.
Investing in parallel recovery for Postgres helps reduce the crash recovery
time and allows us to change the checkpoint frequency to much higher value?
This idea is orthogonal to the double write improvements mentioned in the
thread. Thomas Munro has a patch of doing page prefetching during recovery
which speeds up recovery if the working set doesn't fit in the memory, we
also need parallel recovery to replay huge amounts of WAL, when the working
set is in memory.

Thanks,
Satya

On Mon, Aug 3, 2020 at 11:14 AM Daniel Wood  wrote:

>
> > On 08/03/2020 8:26 AM Robert Haas  wrote:
> ...
> > I think this is what's called a double-write buffer, or what was tried
> > some years ago under that name.  A significant problem is that you
> > have to fsync() the double-write buffer before you can write the WAL.
>
> I don't think it does need to be fsync'ed before the WAL.  If the
> log record has a FPW reference beyond the physical log EOF then we
> don't need to restore the before image because we haven't yet did
> the dirty page write from the cache.  The before image only needs
> to be flushed before the dirty page write.  Usually this will have
> already done.
>
> > ... But for short transactions, such as those
> > performed by pgbench, you'd probably end up with a lot of cases where
> > you had to write 3 pages instead of 2, and not only that, but the
> > writes have to be consecutive rather than simultaneous, and to
> > different parts of the disk rather than sequential. That would likely
> > suck a lot.
>
> Wherever you write the before images, in the WAL or into a separate
> file you would write the same number of pages.  I don't understand
> the 3 pages vs 2 pages comment.
>
> And, "different parts of the disk"???  I wouldn't enable the feature
> on spinning media unless I had a dedicated disk for it.
>
> NOTE:
> If the 90's Informix called this the physical log.  Restoring at
> crash time restored physical consistency after which redo/undo
> recovery achieved logical consistency.  From their doc's:
> "If the before-image of a modified page is stored in the physical-log
> buffer, it is eventually flushed from the physical-log buffer to the
> physical log on disk. The before-image of the page plays a critical role in
> restoring data and fast recovery. For more details, see Physical-Log
> Buffer."
>
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
>
>
>


Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-22 Thread SATYANARAYANA NARLAPURAM
+1 to this feature and I have been thinking about it for sometime. There
are several use cases with marking database read only (no transaction log
generation). Some of the examples in a hosted service scenario are 1/ when
customer runs out of storage space, 2/ Upgrading the server to a different
major version (current server can be set to read only, new one can be built
and then switch DNS), 3/ If user wants to force a database to read only and
not accept writes, may be for import / export a database.

Thanks,
Satya

On Wed, Jul 22, 2020 at 3:04 PM Soumyadeep Chakraborty <
soumyadeep2...@gmail.com> wrote:

> Hello,
>
> I think we should really term this feature, as it stands, as a means to
> solely stop WAL writes from happening.
>
> The feature doesn't truly make the system read-only (e.g. dirty buffer
> flushes may succeed the system being put into a read-only state), which
> does make it confusing to a degree.
>
> Ideally, if we were to have a read-only system, we should be able to run
> pg_checksums on it, or take file-system snapshots etc, without the need
> to shut down the cluster. It would also enable an interesting use case:
> we should also be able to do a live upgrade on any running cluster and
> entertain read-only queries at the same time, given that all the
> cluster's files will be immutable?
>
> So if we are not going to address those cases, we should change the
> syntax and remove the notion of read-only. It could be:
>
> ALTER SYSTEM SET wal_writes TO off|on;
> or
> ALTER SYSTEM SET prohibit_wal TO off|on;
>
> If we are going to try to make it truly read-only, and cater to the
> other use cases, we have to:
>
> Perform a checkpoint before declaring the system read-only (i.e. before
> the command returns). This may be expensive of course, as Andres has
> pointed out in this thread, but it is a price that has to be paid. If we
> do this checkpoint, then we can avoid an additional shutdown checkpoint
> and an end-of-recovery checkpoint (if we restart the primary after a
> crash while in read-only mode). Also, we would have to prevent any
> operation that touches control files, which I am not sure we do today in
> the current patch.
>
> Why not have the best of both worlds? Consider:
>
> ALTER SYSTEM SET read_only to {off, on, wal};
>
> -- on: wal writes off + no writes to disk
> -- off: default
> -- wal: only wal writes off
>
> Of course, there can probably be better syntax for the above.
>
> Regards,
>
> Soumyadeep (VMware)
>
>
>


RE: [HACKERS] Client Connection redirection support for PostgreSQL

2018-03-05 Thread Satyanarayana Narlapuram
Please see the attached patch with the comments.

Changes in the patch:
A client-side PGREDIRECTLIMIT parameter has been introduced to control 
the maximum number of retries. 
BE_v3.1 sends a ProtocolNegotiation message. FE_v3.1 downgrades to v3.0 
upon receipt of this message.
FE falls back to v3.0 if 3.1 is not supported by the server.


>> I hadn't really thought deeply about whether redirection should 
happen before or after authentication.  For the most part, before seems better, 
because it seems a bit silly to force people to authenticate just so that you 
can tell them to go someplace else.  Also, that would lead to double 
authentication,which might for example result in multiple password 
prompts, which users might either dislike or find confusing.  

Yes, redirection before authentication would avoid multiple password 
prompts.

Thanks,
Satya


redirection_v2.patch
Description: redirection_v2.patch


Re: zheap: a new storage format for PostgreSQL

2018-03-01 Thread Satyanarayana Narlapuram

>> Cons

>> ---
>> 1. Deletes can be somewhat expensive.
>> 2. Transaction aborts will be expensive.
>> 3. Updates that update most of the indexed columns can be somewhat expensive.

Given transaction aborts are expensive, is there any impact on the crash 
recovery? Did you perform any tests on the recovery duration?

Thanks,
Satya





From: Amit Kapila 
Sent: Thursday, March 1, 2018 7:05:12 AM
To: PostgreSQL Hackers
Subject: Re: zheap: a new storage format for PostgreSQL

On Thu, Mar 1, 2018 at 7:39 PM, Amit Kapila  wrote:
>
> Preliminary performance results
> ---
>

I have not used plain text mode in my previous email due to which
performance results might not be clear in some email clients, so
attaching it again in the form of pdf.

--
With Regards,
Amit Kapila.
EnterpriseDB: 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com=04%7C01%7CSatyanarayana.Narlapuram%40microsoft.com%7Cad676656345544116aa008d57f85e87d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636555135932006655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1=7z7XUUdXr3CZe71y%2F7kVto%2BzJB5IogypcRHODu8yAu0%3D=0


RE: [HACKERS] Client Connection redirection support for PostgreSQL

2018-02-12 Thread Satyanarayana Narlapuram
I simplified the patch and for now just allowed one server. Please find the 
attached patches, and the commit message.

Thanks,
Satya

-Original Message-
From: Robert Haas <robertmh...@gmail.com> 
Sent: Monday, November 6, 2017 5:56 AM
To: Craig Ringer <cr...@2ndquadrant.com>
Cc: Satyanarayana Narlapuram <satyanarayana.narlapu...@microsoft.com>; 
PostgreSQL-development <pgsql-hack...@postgresql.org>
Subject: Re: [HACKERS] Client Connection redirection support for PostgreSQL

On Thu, Nov 2, 2017 at 4:33 PM, Craig Ringer <cr...@2ndquadrant.com> wrote:
>> Add the ability to the PostgreSQL server instance to route the 
>> traffic to a different server instance based on the rules defined in 
>> server’s pg_bha.conf configuration file. At a high level this enables 
>> offloading the user requests to a different server instance based on 
>> the rules defined in the pg_hba.conf configuration file.
>
> pg_hba.conf is "host based access [control]" . I'm not sure it's 
> really the right place.

Well, we could invent someplace else, but I'm not sure I quite see the point 
(full disclosure: I suggested the idea of doing this via pg_hba.conf in an 
off-list discussion).

I do think the functionality is useful, for the same reasons that HTTP 
redirects are useful.  For example, let's say you have all of your databases 
for various clients on a single instance.  Then, one client starts using a lot 
more resources, so you want to move that client to a separate instance on 
another VM.  You can set up logical replication to replicate all of the data to 
the new instance, and then add a pg_hba.conf entry to redirect connections to 
that database to the new master (this would be even smoother if we had 
multi-master replication in core).  So now that client is moved off to another 
machine in a completely client-transparent way.  I think that's pretty cool.

> When this has come up before, one of the issues has been determining 
> what exactly should constitute "read only" vs "read write" for the 
> purposes of redirecting work.

Yes, that needs some thought.

> Backends used just for a redirect would be pretty expensive though.

Not as expensive as proxying the whole connection, as pgpool and other systems 
do today.  I think the in-core use of this redirect functionality is useful, 
but I think the real win would be optionally using it in pgpool and pgbouncer.

--
Robert Haas
EnterpriseDB: 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com=02%7C01%7CSatyanarayana.Narlapuram%40microsoft.com%7Caafef2039b194d9c02c308d5251e12bb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636455733453945798=8qystAJQ6UhnB7WRQh5i4nF8cyBUvKc9QIBfy59y%2FX8%3D=0
The Enterprise PostgreSQL Company


redirection_be.patch
Description: redirection_be.patch


redirection_fe_doc.patch
Description: redirection_fe_doc.patch
Adding the ability to the PostgreSQL server instance to route the traffic to a 
different server instance based on the rules defined in server's pg_bha.conf 
configuration file. At a high level this enables offloading the user requests 
to a different server instance based on the rules defined in the pg_hba.conf 
configuration file. Some of the interesting scenarios this enables include but 
not limited to - rerouting traffic based on the client hosts, users, database, 
etc. specified, redirecting read-only query traffic to the hot stand by 
replicas, and in multi-master scenarios.

The rules to route the traffic will be provided in the pg_hba.conf file. A new 
optional authorization option called "redirect" was added to the host-based 
authorization (HBA) entry as part of this change. If "redirect" is specified as 
the authorization method, the server will continue to parse the HBA entry to 
obtain the target server name and port. The following is an example of an HBA 
entry with this change:

hostall all 0.0.0.0/32redirect  
, 

If any other authorization method is specified, the server will work as before 
this change.

MESSAGE FLOW

1. Client connects to the server, and server accepts the connections
2. Client sends the startup message
3. Server looks at the rules configured in the pg_hba.conf file and
* If the rule matches redirection
i. Send a special message with the , 
 described above
ii. Server disconnects
* If the rule doesn't have "redirect" specified as the 
authorization mechanism
i. Server proceeds in the existing code path and sends 
auth request

4. Client obtains the ,  tuple 
and attempts to connect to it. If the client could not connect to the server 
instance, it reports the login failure message.

BACKWARD COMPATIBILITY

The pgwire protocol