RE: Multi-Master Logical Replication

2022-06-14 Thread r.takahash...@fujitsu.com
Hi Kuroda san,


> I think even if LRG is implemented as contrib modules or any extensions,
> it will deeply depend on the subscription option "origin" proposed in [1].
> So LRG cannot be used for older version, only PG16 or later.

Sorry, I misunderstood.
I understand now.

Regards,
Ryohei Takahashi


RE: Multi-Master Logical Replication

2022-06-14 Thread kuroda.hay...@fujitsu.com
Dear Takahashi-san,

Thanks for giving feedbacks!

> > I don't know if it requires the kind of code you are thinking but I
> > agree that it is worth considering implementing it as an extension.
> 
> I think the other advantage to implement as an extension is that users could
> install the extension to older Postgres.
> 
> As mentioned in previous email, the one use case of n-way replication is 
> migration
> from older Postgres to newer Postgres.
> 
> If we implement as an extension, users could use n-way replication for 
> migration
> from PG10 to PG16.
>

I think even if LRG is implemented as contrib modules or any extensions,
it will deeply depend on the subscription option "origin" proposed in [1].
So LRG cannot be used for older version, only PG16 or later.

[1]: 
https://www.postgresql.org/message-id/caldanm3pt1cpeb3y9pe7ff91gzvpnxr91y4ztwiw6h+gayg...@mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Multi-Master Logical Replication

2022-06-13 Thread r.takahash...@fujitsu.com
Hi,


In addition to the use cases mentioned above, some users want to use n-way
replication of partial database.

The following is the typical use case.

* There are several data centers.
  (ex. Japan and India)
* The database in each data center has its unique data.
  (ex. the database in Japan has the data related to Japan)
* There are some common data.
  (ex. the shipment data from Japan to India should be stored on both database)
* To replicate common data, users want to use n-way replication.


The current POC patch seems to support only n-way replication of entire 
database, 
but I think we should support n-way replication of partial database to achieve
above use case.


> I don't know if it requires the kind of code you are thinking but I
> agree that it is worth considering implementing it as an extension.

I think the other advantage to implement as an extension is that users could
install the extension to older Postgres.

As mentioned in previous email, the one use case of n-way replication is 
migration
from older Postgres to newer Postgres.

If we implement as an extension, users could use n-way replication for migration
from PG10 to PG16.


Regards,
Ryohei Takahashi


Re: Multi-Master Logical Replication

2022-06-10 Thread Amit Kapila
On Fri, Jun 10, 2022 at 12:40 PM Bharath Rupireddy
 wrote:
>
> On Fri, Jun 10, 2022 at 9:54 AM Amit Kapila  wrote:
> > >
> > > 1. Are you proposing to use logical replication subscribers to be in
> > > sync quorum? In other words, in an N-masters node, M (M >= N)-node
> > > configuration, will each master be part of the sync quorum in the
> > > other master?
> > >
> >
> > What exactly do you mean by sync quorum here? If you mean to say that
> > each master node will be allowed to wait till the commit happens on
> > all other nodes similar to how our current synchronous_commit and
> > synchronous_standby_names work, then yes, it could be achieved. I
> > think the patch currently doesn't support this but it could be
> > extended to support the same. Basically, one can be allowed to set up
> > async and sync nodes in combination depending on its use case.
>
> Yes, I meant each master node will be in synchronous_commit with
> others. In this setup, do you see any problems such as deadlocks if
> write-txns on the same table occur on all the masters at a time?
>

I have not tried but I don't see in theory why this should happen
unless someone tries to update a similar set of rows in conflicting
order similar to how it can happen in a single node. If so, it will
error out and one of the conflicting transactions needs to be retried.
IOW, I think the behavior should be the same as on a single node. Do
you have any particular examples in mind?

> If the master nodes are not in synchronous_commit i.e. connected in
> asynchronous mode, don't we have data synchronous problems because of
> logical decoding and replication latencies? Say, I do a bulk-insert to
> a table foo on master 1, Imagine there's a latency with which the
> inserted rows get replicated to master 2 and meanwhile I do update on
> the same table foo on master 2 based on the rows inserted in master 1
> - master 2 doesn't have all the inserted rows on master 1 - how does
> the solution proposed here address this problem?
>

I don't think that is possible even in theory and none of the other
n-way replication solutions I have read seems to be claiming to have
something like that. It is quite possible that I am missing something
here but why do we want to have such a requirement from asynchronous
replication? I think in such cases even for load balancing we can
distribute reads where eventually consistent data is acceptable and
writes on separate tables/partitions can be distributed.

I haven't responded to some of your other points as they are
associated with the above theory.

>
> > > 4. Can the design proposed here be implemented as an extension instead
> > > of a core postgres solution?
> > >
> >
> > Yes, I think it could be. I think this proposal introduces some system
> > tables, so need to analyze what to do about that.  BTW, do you see any
> > advantages to doing so?
>
> IMO, yes, doing it the extension way has many advantages - it doesn't
> have to touch the core part of postgres, usability will be good -
> whoever requires this solution will use and we can avoid code chunks
> within the core such as if (feature_enabled) { do foo} else { do bar}
> sorts. Since this feature is based on core postgres logical
> replication infrastructure, I think it's worth implementing it as an
> extension first, maybe the extension as a PoC?
>

I don't know if it requires the kind of code you are thinking but I
agree that it is worth considering implementing it as an extension.

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-06-10 Thread Bharath Rupireddy
On Fri, Jun 10, 2022 at 9:54 AM Amit Kapila  wrote:
>
> On Thu, Jun 9, 2022 at 6:04 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
> > >
> > > MULTI-MASTER LOGICAL REPLICATION
> > >
> > > 1.0 BACKGROUND
> > >
> > > Let’s assume that a user wishes to set up a multi-master environment
> > > so that a set of PostgreSQL instances (nodes) use logical replication
> > > to share tables with every other node in the set.
> > >
> > > We define this as a multi-master logical replication (MMLR) node-set.
> > >
> > > 
> > >
> > > 1.1 ADVANTAGES OF MMLR
> > >
> > > - Increases write scalability (e.g., all nodes can write arbitrary data).
> > > - Allows load balancing
> > > - Allows rolling updates of nodes (e.g., logical replication works
> > > between different major versions of PostgreSQL).
> > > - Improves the availability of the system (e.g., no single point of 
> > > failure)
> > > - Improves performance (e.g., lower latencies for geographically local 
> > > nodes)
> >
> > Thanks for working on this proposal. I have a few high-level thoughts,
> > please bear with me if I repeat any of them:
> >
> > 1. Are you proposing to use logical replication subscribers to be in
> > sync quorum? In other words, in an N-masters node, M (M >= N)-node
> > configuration, will each master be part of the sync quorum in the
> > other master?
> >
>
> What exactly do you mean by sync quorum here? If you mean to say that
> each master node will be allowed to wait till the commit happens on
> all other nodes similar to how our current synchronous_commit and
> synchronous_standby_names work, then yes, it could be achieved. I
> think the patch currently doesn't support this but it could be
> extended to support the same. Basically, one can be allowed to set up
> async and sync nodes in combination depending on its use case.

Yes, I meant each master node will be in synchronous_commit with
others. In this setup, do you see any problems such as deadlocks if
write-txns on the same table occur on all the masters at a time?

If the master nodes are not in synchronous_commit i.e. connected in
asynchronous mode, don't we have data synchronous problems because of
logical decoding and replication latencies? Say, I do a bulk-insert to
a table foo on master 1, Imagine there's a latency with which the
inserted rows get replicated to master 2 and meanwhile I do update on
the same table foo on master 2 based on the rows inserted in master 1
- master 2 doesn't have all the inserted rows on master 1 - how does
the solution proposed here address this problem?

> > 3. What if "some" postgres provider assures an SLA of very few seconds
> > for failovers in typical HA set up with primary and multiple sync and
> > async standbys? In this context, where does the multi-master
> > architecture sit in the broad range of postgres use-cases?
> >
>
> I think this is one of the primary use cases of the n-way logical
> replication solution where in there shouldn't be any noticeable wait
> time when one or more of the nodes goes down. All nodes have the
> capability to allow writes so the app just needs to connect to another
> node. I feel some analysis is required to find out and state exactly
> how the users can achieve this but seems doable. The other use cases
> are discussed in this thread and are summarized in emails [1][2].

IIUC, the main goals of this feature are - zero failover times and
less write latencies, right? How is it going to solve the data
synchronization problem (stated above) with the master nodes connected
to each other in asynchronous mode?

> > 4. Can the design proposed here be implemented as an extension instead
> > of a core postgres solution?
> >
>
> Yes, I think it could be. I think this proposal introduces some system
> tables, so need to analyze what to do about that.  BTW, do you see any
> advantages to doing so?

IMO, yes, doing it the extension way has many advantages - it doesn't
have to touch the core part of postgres, usability will be good -
whoever requires this solution will use and we can avoid code chunks
within the core such as if (feature_enabled) { do foo} else { do bar}
sorts. Since this feature is based on core postgres logical
replication infrastructure, I think it's worth implementing it as an
extension first, maybe the extension as a PoC?

> > 5. Why should one use logical replication for multi master
> > replication? If logical replication is used, isn't it going to be
> > something like logically decode and replicate every WAL record from
> > one master to all other masters? Instead, can't it be achieved via
> > streaming/physical replication?
> >
>
> The failover/downtime will be much lesser in a solution based on
> logical replication because all nodes are master nodes and users will
> be allowed to write on other nodes instead of waiting for the physical
> standby to become writeable.

I don't think that's a correct statement unless the design proposed
here addresses the 

Re: Multi-Master Logical Replication

2022-06-09 Thread Amit Kapila
On Thu, Jun 9, 2022 at 6:04 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
> >
> > MULTI-MASTER LOGICAL REPLICATION
> >
> > 1.0 BACKGROUND
> >
> > Let’s assume that a user wishes to set up a multi-master environment
> > so that a set of PostgreSQL instances (nodes) use logical replication
> > to share tables with every other node in the set.
> >
> > We define this as a multi-master logical replication (MMLR) node-set.
> >
> > 
> >
> > 1.1 ADVANTAGES OF MMLR
> >
> > - Increases write scalability (e.g., all nodes can write arbitrary data).
> > - Allows load balancing
> > - Allows rolling updates of nodes (e.g., logical replication works
> > between different major versions of PostgreSQL).
> > - Improves the availability of the system (e.g., no single point of failure)
> > - Improves performance (e.g., lower latencies for geographically local 
> > nodes)
>
> Thanks for working on this proposal. I have a few high-level thoughts,
> please bear with me if I repeat any of them:
>
> 1. Are you proposing to use logical replication subscribers to be in
> sync quorum? In other words, in an N-masters node, M (M >= N)-node
> configuration, will each master be part of the sync quorum in the
> other master?
>

What exactly do you mean by sync quorum here? If you mean to say that
each master node will be allowed to wait till the commit happens on
all other nodes similar to how our current synchronous_commit and
synchronous_standby_names work, then yes, it could be achieved. I
think the patch currently doesn't support this but it could be
extended to support the same. Basically, one can be allowed to set up
async and sync nodes in combination depending on its use case.

> 2. Is there any mention of reducing the latencies that logical
> replication will have generally (initial table sync and
> after-caught-up decoding and replication latencies)?
>

No, this won't change under the hood replication mechanism.

> 3. What if "some" postgres provider assures an SLA of very few seconds
> for failovers in typical HA set up with primary and multiple sync and
> async standbys? In this context, where does the multi-master
> architecture sit in the broad range of postgres use-cases?
>

I think this is one of the primary use cases of the n-way logical
replication solution where in there shouldn't be any noticeable wait
time when one or more of the nodes goes down. All nodes have the
capability to allow writes so the app just needs to connect to another
node. I feel some analysis is required to find out and state exactly
how the users can achieve this but seems doable. The other use cases
are discussed in this thread and are summarized in emails [1][2].

> 4. Can the design proposed here be implemented as an extension instead
> of a core postgres solution?
>

Yes, I think it could be. I think this proposal introduces some system
tables, so need to analyze what to do about that.  BTW, do you see any
advantages to doing so?

> 5. Why should one use logical replication for multi master
> replication? If logical replication is used, isn't it going to be
> something like logically decode and replicate every WAL record from
> one master to all other masters? Instead, can't it be achieved via
> streaming/physical replication?
>

The failover/downtime will be much lesser in a solution based on
logical replication because all nodes are master nodes and users will
be allowed to write on other nodes instead of waiting for the physical
standby to become writeable. Then it will allow more localized
database access for geographically distributed databases, see the
email for further details on this [3]. Also, the benefiting scenarios
are the same as all usual Logical Replication quoted benefits - e.g
version independence, getting selective/required data, etc.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BZP9c6q1BQWSQC__w09WQ-qGt22dTmajDmTxR_CAUyJQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/TYAPR01MB58660FCFEC7633E15106C94BF5A29%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1%2BDRHCNLongM0stsVBY01S-s%3DEa_yjBFnv_Uz3m3Hky-w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-06-09 Thread Bharath Rupireddy
On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
>
> MULTI-MASTER LOGICAL REPLICATION
>
> 1.0 BACKGROUND
>
> Let’s assume that a user wishes to set up a multi-master environment
> so that a set of PostgreSQL instances (nodes) use logical replication
> to share tables with every other node in the set.
>
> We define this as a multi-master logical replication (MMLR) node-set.
>
> 
>
> 1.1 ADVANTAGES OF MMLR
>
> - Increases write scalability (e.g., all nodes can write arbitrary data).
> - Allows load balancing
> - Allows rolling updates of nodes (e.g., logical replication works
> between different major versions of PostgreSQL).
> - Improves the availability of the system (e.g., no single point of failure)
> - Improves performance (e.g., lower latencies for geographically local nodes)

Thanks for working on this proposal. I have a few high-level thoughts,
please bear with me if I repeat any of them:

1. Are you proposing to use logical replication subscribers to be in
sync quorum? In other words, in an N-masters node, M (M >= N)-node
configuration, will each master be part of the sync quorum in the
other master?
2. Is there any mention of reducing the latencies that logical
replication will have generally (initial table sync and
after-caught-up decoding and replication latencies)?
3. What if "some" postgres provider assures an SLA of very few seconds
for failovers in typical HA set up with primary and multiple sync and
async standbys? In this context, where does the multi-master
architecture sit in the broad range of postgres use-cases?
4. Can the design proposed here be implemented as an extension instead
of a core postgres solution?
5. Why should one use logical replication for multi master
replication? If logical replication is used, isn't it going to be
something like logically decode and replicate every WAL record from
one master to all other masters? Instead, can't it be achieved via
streaming/physical replication?

Regards,
Bharath Rupireddy.




RE: Multi-Master Logical Replication

2022-06-06 Thread kuroda.hay...@fujitsu.com
Dear hackers,

I found another use-case for LRG. It might be helpful for migration.


LRG for migration
--
LRG may be helpful for machine migration, OS upgrade,
or PostgreSQL itself upgrade.

Assumes that users want to migrate database to other environment,
e.g., PG16 on RHEL7 to PG18 on RHEL8.
Users must copy all data into new server and catchup all changes.
In this case streaming replication cannot be used
because it requires same OS and same PostgreSQL major version.
Moreover, it is desirable to be able to return to the original environment at 
any time
in case of application or other environmental deficiencies.


Operation steps with LRG
--

LRG is appropriate for the situation. Following lines are the workflow that 
users must do:

1. Copy the table definition to the newer node(PG18), via pg_dump/pg_restore
2. Execute lrg_create() in the older node(PG16)
3. Execute lrg_node_attach() in PG18

=== data will be shared here===

4. Change the connection of the user application to PG18
5. Check whether ERROR is raised or not. If some ERRORs are raised,
  users can change back the connection to PG16.
6. Remove the created node group if application works well.

These operations may reduce system downtime
due to incompatibilities associated with version upgrades.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Multi-Master Logical Replication

2022-06-04 Thread Amit Kapila
On Fri, Jun 3, 2022 at 7:12 AM Bruce Momjian  wrote:
>
> On Thu, Jun  2, 2022 at 05:12:49PM +1000, Peter Smith wrote:
> > On Thu, Jun 2, 2022 at 12:03 AM Bruce Momjian  wrote:
> > >
> > > On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
> > ...
> >
> > > My big point is that you should not be showing up with a patch but
> > > rather have these discussions to get agreement that this is the
> > > direction the community wants to go.
> >
> > The purpose of posting the POC patch was certainly not to present a
> > fait accompli design/implementation.
> >
> > We wanted to solicit some community feedback about the desirability of
> > the feature, but because LRG is complicated to describe we felt that
> > having a basic functional POC might help to better understand the
> > proposal. Also, we thought the ability to experiment with the proposed
> > API could help people to decide whether LRG is something worth
> > pursuing or not.
>
> I don't think the POC is helping, and I am not sure we really want to
> support this style of architecture due to its complexity vs other
> options.
>

None of the other options discussed on this thread appears to be
better or can serve the intent. What other options do you have in mind
and how are they simpler than this? As far as I can understand this
provides a simple way to set up n-way replication among nodes.

I see that other databases provide similar ways to set up n-way
replication. See [1] and in particular [2][3][4] provides a way to set
up n-way replication via APIs. Yet, another way is via configuration
as seems to be provided by MySQL [5] (Group Replication Settings).
Most of the advantages have already been shared but let me summarize
again the benefits it brings (a) more localized database access for
geographically distributed databases, (b) ensuring continuous
availability in case of the primary site becomes unavailable due to a
system or network outage, any natural disaster on the site, (c)
environments that require a fluid replication infrastructure, where
the number of servers has to grow or shrink dynamically and with as
few side-effects as possible. For instance, database services for the
cloud, and (d) load balancing. Some of these can probably be served in
other ways but not everything.

I see your point about POC not helping here and it can also sometimes
discourage OP if we decide not to do this feature or do it in an
entirely different way. But OTOH, I don't see it stopping us from
discussing the desirability or design of this feature.

[1] - https://docs.oracle.com/cd/E18283_01/server.112/e10707/rarrcatpac.htm
[2] - 
https://docs.oracle.com/cd/E18283_01/server.112/e10707/rarrcatpac.htm#i96251
[3] - 
https://docs.oracle.com/cd/E18283_01/server.112/e10707/rarrcatpac.htm#i94500
[4] - 
https://docs.oracle.com/cd/E18283_01/server.112/e10707/rarrcatpac.htm#i97185
[5] - 
https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-instances.html

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-06-02 Thread Bruce Momjian
On Thu, Jun  2, 2022 at 05:12:49PM +1000, Peter Smith wrote:
> On Thu, Jun 2, 2022 at 12:03 AM Bruce Momjian  wrote:
> >
> > On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
> ...
> 
> > My big point is that you should not be showing up with a patch but
> > rather have these discussions to get agreement that this is the
> > direction the community wants to go.
> 
> The purpose of posting the POC patch was certainly not to present a
> fait accompli design/implementation.
> 
> We wanted to solicit some community feedback about the desirability of
> the feature, but because LRG is complicated to describe we felt that
> having a basic functional POC might help to better understand the
> proposal. Also, we thought the ability to experiment with the proposed
> API could help people to decide whether LRG is something worth
> pursuing or not.

I don't think the POC is helping, and I am not sure we really want to
support this style of architecture due to its complexity vs other
options.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Multi-Master Logical Replication

2022-06-02 Thread Peter Smith
On Thu, Jun 2, 2022 at 12:03 AM Bruce Momjian  wrote:
>
> On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
...

> My big point is that you should not be showing up with a patch but
> rather have these discussions to get agreement that this is the
> direction the community wants to go.

The purpose of posting the POC patch was certainly not to present a
fait accompli design/implementation.

We wanted to solicit some community feedback about the desirability of
the feature, but because LRG is complicated to describe we felt that
having a basic functional POC might help to better understand the
proposal. Also, we thought the ability to experiment with the proposed
API could help people to decide whether LRG is something worth
pursuing or not.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Multi-Master Logical Replication

2022-06-02 Thread Amit Kapila
On Wed, Jun 1, 2022 at 7:33 PM Bruce Momjian  wrote:
>
> On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
> > On Tue, May 31, 2022 at 7:36 PM Bruce Momjian  wrote:
> > > Uh, thinking some more, why would anyone set things up this way ---
> > > having part of a table being primary on one server and a different part
> > > of the table be a subscriber.  Seems it would be simpler and safer to
> > > create two child tables and have one be primary on only one server.
> > > Users can access both tables using the parent.
> >
> > Yes, users can choose to do that way but still, to keep the nodes in
> > sync and continuity of operations, it will be very difficult to manage
> > the operations without the LRG APIs. Let us consider a simple two-node
> > example where on each node there is Table T that has partitions P1 and
> > P2. As far as I can understand, one needs to have the below kind of
> > set-up to allow local operations on geographically distributed nodes.
> >
> > Node-1:
> > node1 writes to P1
> > node1 publishes P1
> > node2 subscribes to P1 of node1
> >
> > Node-2:
> > node2 writes to P2
> > node2 publishes P2
> > node1 subscribes to P2 on node2
>
> Yes, that is how you would set it up.
>
> > In this setup, we need to publish individual partitions, otherwise, we
> > will face the loop problem where the data sent by node-1 to node-2 via
> > logical replication will again come back to it causing problems like
> > constraints violations, duplicate data, etc. There could be other ways
> > to do this set up with current logical replication commands (for ex.
> > publishing via root table) but that would require ways to avoid loops
> > and could have other challenges.
>
> Right, individual paritions.
>
> > Now, in such a setup/scheme, consider a scenario (scenario-1), where
> > node-2 went off (either it crashes, went out of network, just died,
> > etc.) and comes up after some time. Now, one can either make the
> > node-2 available by fixing the problem it has or can promote standby
> > in that location (if any) to become master, both might require some
> > time. In the meantime to continue the operations (which provides a
> > seamless experience to users), users will be connected to node-1 to
> > perform the required write operations. Now, to achieve this without
> > LRG APIs, it will be quite complex for users to keep the data in sync.
> > One needs to perform various steps to get the partition P2 data that
> > went to node-1 till the time node-2 was not available. On node-1, it
> > has to publish P2 changes for the time node-2 becomes available with
> > the help of Create/Drop Publication APIs. And when node-2 comes back,
> > it has to create a subscription for the above publication pub-2 to get
> > that data, ensure both the nodes and in sync, and then allow
> > operations on node-2.
>
> Well, you are going to need to modify the app so it knows it can write
> to both partitions on failover anyway.
>

I am not sure if this point is clear to me. From what I can understand
there are two possibilities for the app in this case and both seem to
be problematic.

(a) The app can be taught to write to the P2 partition in node-1 till
the time node-2 is not available. If so, how will we get the partition
P2 data that went to node-1 till the time node-2 was unavailable? If
we don't get the data to node-2 then the operations on node-2 (once it
comes back) can return incorrect results. Also, we need to ensure all
the data for P2 that went to node-1 should be replicated to all other
nodes in the system and for that also we need to create new
subscriptions pointing to node-1. It is easier to think of doing this
for physical replication where after failover the old master node can
start following the new node and the app just need to be taught to
write to the new master node. I can't see how we can achieve that by
current logical replication APIs (apart from doing the complex steps
shared by me). One of the purposes of these new LRG APIs is to ensure
that users don't need to follow those complex steps after failover.

(b) The other possibility is that the app is responsible to ensure
that the same data is written on both node-1 and node-2 for the time
one of those is not available. For that app needs to store the data at
someplace for the time one of the nodes is unavailable and then write
it once the other node becomes available? Also, it won't be practical
when there are more partitions (say 10 or more) as all the partitions
data needs to be present on each node. I think it is the
responsibility of the database to keep the data in sync among nodes
when one or more of the nodes are not available.

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-06-01 Thread Bruce Momjian
On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
> On Tue, May 31, 2022 at 7:36 PM Bruce Momjian  wrote:
> > Uh, thinking some more, why would anyone set things up this way ---
> > having part of a table being primary on one server and a different part
> > of the table be a subscriber.  Seems it would be simpler and safer to
> > create two child tables and have one be primary on only one server.
> > Users can access both tables using the parent.
> 
> Yes, users can choose to do that way but still, to keep the nodes in
> sync and continuity of operations, it will be very difficult to manage
> the operations without the LRG APIs. Let us consider a simple two-node
> example where on each node there is Table T that has partitions P1 and
> P2. As far as I can understand, one needs to have the below kind of
> set-up to allow local operations on geographically distributed nodes.
> 
> Node-1:
> node1 writes to P1
> node1 publishes P1
> node2 subscribes to P1 of node1
> 
> Node-2:
> node2 writes to P2
> node2 publishes P2
> node1 subscribes to P2 on node2

Yes, that is how you would set it up.

> In this setup, we need to publish individual partitions, otherwise, we
> will face the loop problem where the data sent by node-1 to node-2 via
> logical replication will again come back to it causing problems like
> constraints violations, duplicate data, etc. There could be other ways
> to do this set up with current logical replication commands (for ex.
> publishing via root table) but that would require ways to avoid loops
> and could have other challenges.

Right, individual paritions.

> Now, in such a setup/scheme, consider a scenario (scenario-1), where
> node-2 went off (either it crashes, went out of network, just died,
> etc.) and comes up after some time. Now, one can either make the
> node-2 available by fixing the problem it has or can promote standby
> in that location (if any) to become master, both might require some
> time. In the meantime to continue the operations (which provides a
> seamless experience to users), users will be connected to node-1 to
> perform the required write operations. Now, to achieve this without
> LRG APIs, it will be quite complex for users to keep the data in sync.
> One needs to perform various steps to get the partition P2 data that
> went to node-1 till the time node-2 was not available. On node-1, it
> has to publish P2 changes for the time node-2 becomes available with
> the help of Create/Drop Publication APIs. And when node-2 comes back,
> it has to create a subscription for the above publication pub-2 to get
> that data, ensure both the nodes and in sync, and then allow
> operations on node-2.

Well, you are going to need to modify the app so it knows it can write
to both partitions on failover anyway.  I just don't see how adding this
complexity is wise.

My big point is that you should not be showing up with a patch but
rather have these discussions to get agreement that this is the
direction the community wants to go.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Multi-Master Logical Replication

2022-05-31 Thread Amit Kapila
On Tue, May 31, 2022 at 7:36 PM Bruce Momjian  wrote:
>
> On Wed, May 25, 2022 at 10:32:50PM -0400, Bruce Momjian wrote:
> > On Wed, May 25, 2022 at 12:13:17PM +0530, Amit Kapila wrote:
> > >
> > > It helps with setting up logical replication among two or more nodes
> > > (data flows both ways) which is important for use cases where
> > > applications are data-aware. For such apps, it will be beneficial to
> >
> > That does make sense, thanks.
>
> Uh, thinking some more, why would anyone set things up this way ---
> having part of a table being primary on one server and a different part
> of the table be a subscriber.  Seems it would be simpler and safer to
> create two child tables and have one be primary on only one server.
> Users can access both tables using the parent.
>

Yes, users can choose to do that way but still, to keep the nodes in
sync and continuity of operations, it will be very difficult to manage
the operations without the LRG APIs. Let us consider a simple two-node
example where on each node there is Table T that has partitions P1 and
P2. As far as I can understand, one needs to have the below kind of
set-up to allow local operations on geographically distributed nodes.

Node-1:
node1 writes to P1
node1 publishes P1
node2 subscribes to P1 of node1

Node-2:
node2 writes to P2
node2 publishes P2
node1 subscribes to P2 on node2

In this setup, we need to publish individual partitions, otherwise, we
will face the loop problem where the data sent by node-1 to node-2 via
logical replication will again come back to it causing problems like
constraints violations, duplicate data, etc. There could be other ways
to do this set up with current logical replication commands (for ex.
publishing via root table) but that would require ways to avoid loops
and could have other challenges.

Now, in such a setup/scheme, consider a scenario (scenario-1), where
node-2 went off (either it crashes, went out of network, just died,
etc.) and comes up after some time. Now, one can either make the
node-2 available by fixing the problem it has or can promote standby
in that location (if any) to become master, both might require some
time. In the meantime to continue the operations (which provides a
seamless experience to users), users will be connected to node-1 to
perform the required write operations. Now, to achieve this without
LRG APIs, it will be quite complex for users to keep the data in sync.
One needs to perform various steps to get the partition P2 data that
went to node-1 till the time node-2 was not available. On node-1, it
has to publish P2 changes for the time node-2 becomes available with
the help of Create/Drop Publication APIs. And when node-2 comes back,
it has to create a subscription for the above publication pub-2 to get
that data, ensure both the nodes and in sync, and then allow
operations on node-2.

Not only this, but if there are more nodes in this set-up (say-10), it
has to change (drop/create) subscriptions corresponding to partition
P2 on all other nodes as each individual node is the owner of some
partition.

Another possibility is that the entire data center where node-2 was
present was gone due to some unfortunate incident in which case they
need to set up a new data center and hence a new node. Now, in such a
case, the user needs to do all the steps mentioned in the previous
scenario and additionally, it needs to ensure that it set up the node
to sync all the existing data (of all partitions) before this node
again starts receiving write changes for partition P2.

I think all this should be relatively simpler with LRG APIs wherein
for the second scenario user ideally just needs to use the lrg_attach*
API and in the first scenario, it should automatically sync the
missing data once the node-2 comes back.

Now, the other important point that we should also consider for these
LRG APIs is the ease of setup even in the normal case where we are
just adding a new node as mentioned by Peter Smith in his email [1]
(LRG makes setup easier). e.g. even if there are many nodes we only
need a single lrg_attach by the joining node instead of needing N-1
subscriptions on all the existing nodes.

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPsvvfTWWwE8vkgUg4q%2BQLyoCyNE7NU%3DmEiYHcMcXciXdg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-05-31 Thread Bruce Momjian
On Wed, May 25, 2022 at 10:32:50PM -0400, Bruce Momjian wrote:
> On Wed, May 25, 2022 at 12:13:17PM +0530, Amit Kapila wrote:
> > > You still have not answered my question above.  "Without these features,
> > > what workload would this help with?"  You have only explained how the
> > > patch would fix one of the many larger problems.
> > >
> > 
> > It helps with setting up logical replication among two or more nodes
> > (data flows both ways) which is important for use cases where
> > applications are data-aware. For such apps, it will be beneficial to
> 
> That does make sense, thanks.

Uh, thinking some more, why would anyone set things up this way ---
having part of a table being primary on one server and a different part
of the table be a subscriber.  Seems it would be simpler and safer to
create two child tables and have one be primary on only one server. 
Users can access both tables using the parent.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Multi-Master Logical Replication

2022-05-25 Thread Bruce Momjian
On Wed, May 25, 2022 at 12:13:17PM +0530, Amit Kapila wrote:
> > You still have not answered my question above.  "Without these features,
> > what workload would this help with?"  You have only explained how the
> > patch would fix one of the many larger problems.
> >
> 
> It helps with setting up logical replication among two or more nodes
> (data flows both ways) which is important for use cases where
> applications are data-aware. For such apps, it will be beneficial to

That does make sense, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Multi-Master Logical Replication

2022-05-25 Thread Peter Smith
On Wed, May 25, 2022 at 4:43 PM Amit Kapila  wrote:
>
> On Tue, May 24, 2022 at 5:57 PM Bruce Momjian  wrote:
> >
> > On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:
> > > On Sat, May 14, 2022 at 12:33 AM Bruce Momjian  wrote:
> > > >
> > > > Uh, without these features, what workload would this help with?
> > > >
> > >
> > > To allow replication among multiple nodes when some of the nodes may
> > > have pre-existing data. This work plans to provide simple APIs to
> > > achieve that. Now, let me try to explain the difficulties users can
> > > face with the existing interface. It is simple to set up replication
> > > among various nodes when they don't have any pre-existing data but
> > > even in that case if the user operates on the same table at multiple
> > > nodes, the replication will lead to an infinite loop and won't
> > > proceed. The example in email [1] demonstrates that and the patch in
> > > that thread attempts to solve it. I have mentioned that problem
> > > because this work will need that patch.
> > ...
> > > This will become more complicated when more than two nodes are
> > > involved, see the example provided for the three nodes case [2]. Can
> > > you think of some other simpler way to achieve the same? If not, I
> > > don't think the current way is ideal and even users won't prefer that.
> > > I am not telling that the APIs proposed in this thread is the only or
> > > best way to achieve the desired purpose but I think we should do
> > > something to allow users to easily set up replication among multiple
> > > nodes.
> >
> > You still have not answered my question above.  "Without these features,
> > what workload would this help with?"  You have only explained how the
> > patch would fix one of the many larger problems.
> >
>
> It helps with setting up logical replication among two or more nodes
> (data flows both ways) which is important for use cases where
> applications are data-aware. For such apps, it will be beneficial to
> always send and retrieve data to local nodes in a geographically
> distributed database. Now, for such apps, to get 100% consistent data
> among nodes, one needs to enable synchronous_mode (aka set
> synchronous_standby_names) but if that hurts performance and the data
> is for analytical purposes then one can use it in asynchronous mode.
> Now, for such cases, if the local node goes down, the other master
> node can be immediately available to use, sure it may slow down the
> operations for some time till the local node come-up. For such apps,
> later it will be also easier to perform online upgrades.
>
> Without this, if the user tries to achieve the same via physical
> replication by having two local nodes, it can take quite long before
> the standby can be promoted to master and local reads/writes will be
> much costlier.
>

As mentioned above, the LRG idea might be a useful addition to logical
replication for configuring certain types of "data-aware"
applications.

LRG for data-aware apps (e.g. sensor data)
--
Consider an example where there are multiple weather stations for a
country. Each weather station is associated with a PostgreSQL node and
inserts the local sensor data (e.g wind/rain/sunshine etc) once a
minute to some local table. The row data is identified by some station
ID.

- Perhaps there are many nodes.

- Loss of a single row of replicated sensor data if some node goes
down is not a major problem for this sort of application.

- Benefits of processing data locally can be realised.

- Using LRG simplifies the setup/sharing of the data across all group
nodes via a common table.

~~

LRG makes setup easier
--
Although it is possible already (using Vignesh's "infinite recursion"
WIP patch [1]) to set up this kind of environment using logical
replication, as the number of nodes grows it becomes more and more
difficult to do it. For each new node, there needs to be N-1 x CREATE
SUBSCRIPTION for the other group nodes, meaning the connection details
for every other node also must be known up-front for the script.

OTOH, the LRG API can simplify all this, removing the user's burden
and risk of mistakes. Also, LRG only needs to know how to reach just 1
other node in the group (the implementation will discover all the
other node connection details internally).

~~

LRG can handle initial table data

If the joining node (e.g. a new weather station) already has some
initial local sensor data then sharing that initial data manually with
all the other nodes requires some tricky steps. LRG can hide all this
complexity behind the API, so it is not a user problem anymore.

--
[1] 
https://www.postgresql.org/message-id/flat/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Multi-Master Logical Replication

2022-05-25 Thread Amit Kapila
On Tue, May 24, 2022 at 5:57 PM Bruce Momjian  wrote:
>
> On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:
> > On Sat, May 14, 2022 at 12:33 AM Bruce Momjian  wrote:
> > >
> > > Uh, without these features, what workload would this help with?
> > >
> >
> > To allow replication among multiple nodes when some of the nodes may
> > have pre-existing data. This work plans to provide simple APIs to
> > achieve that. Now, let me try to explain the difficulties users can
> > face with the existing interface. It is simple to set up replication
> > among various nodes when they don't have any pre-existing data but
> > even in that case if the user operates on the same table at multiple
> > nodes, the replication will lead to an infinite loop and won't
> > proceed. The example in email [1] demonstrates that and the patch in
> > that thread attempts to solve it. I have mentioned that problem
> > because this work will need that patch.
> ...
> > This will become more complicated when more than two nodes are
> > involved, see the example provided for the three nodes case [2]. Can
> > you think of some other simpler way to achieve the same? If not, I
> > don't think the current way is ideal and even users won't prefer that.
> > I am not telling that the APIs proposed in this thread is the only or
> > best way to achieve the desired purpose but I think we should do
> > something to allow users to easily set up replication among multiple
> > nodes.
>
> You still have not answered my question above.  "Without these features,
> what workload would this help with?"  You have only explained how the
> patch would fix one of the many larger problems.
>

It helps with setting up logical replication among two or more nodes
(data flows both ways) which is important for use cases where
applications are data-aware. For such apps, it will be beneficial to
always send and retrieve data to local nodes in a geographically
distributed database. Now, for such apps, to get 100% consistent data
among nodes, one needs to enable synchronous_mode (aka set
synchronous_standby_names) but if that hurts performance and the data
is for analytical purposes then one can use it in asynchronous mode.
Now, for such cases, if the local node goes down, the other master
node can be immediately available to use, sure it may slow down the
operations for some time till the local node come-up. For such apps,
later it will be also easier to perform online upgrades.

Without this, if the user tries to achieve the same via physical
replication by having two local nodes, it can take quite long before
the standby can be promoted to master and local reads/writes will be
much costlier.

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-05-24 Thread Bruce Momjian
On Sat, May 14, 2022 at 12:20:05PM +0530, Amit Kapila wrote:
> On Sat, May 14, 2022 at 12:33 AM Bruce Momjian  wrote:
> >
> > Uh, without these features, what workload would this help with?
> >
> 
> To allow replication among multiple nodes when some of the nodes may
> have pre-existing data. This work plans to provide simple APIs to
> achieve that. Now, let me try to explain the difficulties users can
> face with the existing interface. It is simple to set up replication
> among various nodes when they don't have any pre-existing data but
> even in that case if the user operates on the same table at multiple
> nodes, the replication will lead to an infinite loop and won't
> proceed. The example in email [1] demonstrates that and the patch in
> that thread attempts to solve it. I have mentioned that problem
> because this work will need that patch.
...
> This will become more complicated when more than two nodes are
> involved, see the example provided for the three nodes case [2]. Can
> you think of some other simpler way to achieve the same? If not, I
> don't think the current way is ideal and even users won't prefer that.
> I am not telling that the APIs proposed in this thread is the only or
> best way to achieve the desired purpose but I think we should do
> something to allow users to easily set up replication among multiple
> nodes.

You still have not answered my question above.  "Without these features,
what workload would this help with?"  You have only explained how the
patch would fix one of the many larger problems.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





RE: Multi-Master Logical Replication

2022-05-18 Thread kuroda.hay...@fujitsu.com
Hi hackers,

I created a small PoC. Please see the attached patches.

REQUIREMENT

Before patching them, patches in [1] must also be applied.


DIFFERENCES FROM PREVIOUS DESCRIPTIONS

* LRG is now implemented as SQL functions, not as a contrib module.
* New tables are added as system catalogs. Therefore, added tables have oid 
column.
* The node_id is the strcat of system identifier and dbid.


HOW TO USE

In the document patch, a subsection 'Example' was added for understanding LRG. 
In short, we can do

1. lrg_create on one node
2. lrg_node_attach on another node

Also attached is a test script that constructs a three-nodes system.


LIMITATIONS

This feature is under development, so there are many limitations for use case.

* The function for detaching a node from a group is not implemented.
* The function for removing a group is not implemented.
* LRG does not lock system catalogs and databases. Concurrent operations may 
cause inconsistent state.
* LRG does not wait until the upstream node reaches the latest lsn of the 
remaining nodes.
* LRG does not support initial data sync. That is, it can work well only when 
all nodes do not have initial data.


[1]: https://commitfest.postgresql.org/38/3610/

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v1-0001-PoC-implement-LRG.patch
Description: v1-0001-PoC-implement-LRG.patch


v1-0002-add-doc.patch
Description: v1-0002-add-doc.patch


test.sh
Description: test.sh


Re: Multi-Master Logical Replication

2022-05-14 Thread Amit Kapila
On Sat, May 14, 2022 at 12:33 AM Bruce Momjian  wrote:
>
> Uh, without these features, what workload would this help with?
>

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1] demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

Now, let's take a simple case where two nodes have the same table
which has some pre-existing data:

Node-1:
Table t1 (c1 int) has data
1, 2, 3, 4

Node-2:
Table t1 (c1 int) has data
5, 6, 7, 8

If we have to set up replication among the above two nodes using
existing interfaces, it could be very tricky. Say user performs
operations like below:

Node-1
#Publication for t1
Create Publication pub1 For Table t1;

Node-2
#Publication for t1,
Create Publication pub1_2 For Table t1;

Node-1:
Create Subscription sub1 Connection '' Publication pub1_2;

Node-2:
Create Subscription sub1_2 Connection '' Publication pub1;

After this the data will be something like this:
Node-1:
1, 2, 3, 4, 5, 6, 7, 8

Node-2:
1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8

So, you can see that data on Node-2 (5, 6, 7, 8) is duplicated. In
case, table t1 has a unique key, it will lead to a unique key
violation and replication won't proceed. Here, I have assumed that we
already have functionality for the patch in email [1], otherwise,
replication will be an infinite loop replicating the above data again
and again. Now one way to achieve this could be that we can ask users
to stop all operations on both nodes before starting replication
between those and take data dumps of tables from each node they want
to replicate and restore them to other nodes. Then use the above
commands to set up replication and allow to start operations on those
nodes. The other possibility for users could be as below. Assume, we
have already created publications as in the above example, and then:

Node-2:
Create Subscription sub1_2 Connection '' Publication pub1;

#Wait for the initial sync of table t1 to finish. Users can ensure
that by checking 'srsubstate' in pg_subscription_rel.

Node-1:
Begin;
# Disallow truncates to be published and then truncate the table
Alter Publication pub1 Set (publish = 'insert, update, delete');
Truncate t1;
Create Subscription sub1 Connection '' Publication pub1_2;
Alter Publication pub1 Set (publish = 'insert, update, delete, truncate');
Commit;

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

[1] - 
https://www.postgresql.org/message-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CALDaNm3aD3nZ0HWXA8V435AGMvORyR5-mq2FzqQdKQ8CPomB5Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Multi-Master Logical Replication

2022-05-13 Thread Bruce Momjian
On Fri, Apr 29, 2022 at 07:05:11PM +1000, Peter Smith wrote:
> This MMLR proposal was mostly just to create an interface making it
> easier to use PostgreSQL core logical replication CREATE
> PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
> Otherwise, this is difficult for a user to do manually. (e.g.
> difficulties as mentioned in section 2.2 of the original post [1] -
> dealing with initial table data, coordinating the timing/locking to
> avoid concurrent updates, getting the SUBSCRIPTION options for
> copy_data exactly right etc)
> 
> At this time we have no provision for HA, nor for transaction
> consistency awareness, conflict resolutions, node failure detections,
> DDL replication etc. Some of the features like DDL replication are
> currently being implemented [2], so when committed it will become
> available in the core, and can then be integrated into this module.

Uh, without these features, what workload would this help with?  I think
you made the mistake of jumping too far into implementation without
explaining the problem you are trying to solve.  The TODO list has this
ordering:

https://wiki.postgresql.org/wiki/Todo
Desirability -> Design -> Implement -> Test -> Review -> Commit

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Multi-Master Logical Replication

2022-05-11 Thread vignesh C
On Thu, Apr 28, 2022 at 5:20 AM Peter Smith  wrote:
>
> MULTI-MASTER LOGICAL REPLICATION
>
> 1.0 BACKGROUND
>
> Let’s assume that a user wishes to set up a multi-master environment
> so that a set of PostgreSQL instances (nodes) use logical replication
> to share tables with every other node in the set.
>
> We define this as a multi-master logical replication (MMLR) node-set.
>
> 
>
> 1.1 ADVANTAGES OF MMLR
>
> - Increases write scalability (e.g., all nodes can write arbitrary data).
> - Allows load balancing
> - Allows rolling updates of nodes (e.g., logical replication works
> between different major versions of PostgreSQL).
> - Improves the availability of the system (e.g., no single point of failure)
> - Improves performance (e.g., lower latencies for geographically local nodes)
>
> 2.0 MMLR AND POSTGRESQL
>
> It is already possible to configure a kind of MMLR set in PostgreSQL
> 15 using PUB/SUB, but it is very restrictive because it can only work
> when no two nodes operate on the same table. This is because when two
> nodes try to share the same table then there becomes a circular
> recursive problem where Node1 replicates data to Node2 which is then
> replicated back to Node1 and so on.
>
> To prevent the circular recursive problem Vignesh is developing a
> patch [1] that introduces new SUBSCRIPTION options "local_only" (for
> publishing only data originating at the publisher node) and
> "copy_data=force". Using this patch, we have created a script [2]
> demonstrating how to set up all the above multi-node examples. An
> overview of the necessary steps is given in the next section.
>
> 2.1 STEPS – Adding a new node N to an existing node-set
>
> step 1. Prerequisites – Apply Vignesh’s patch [1]. All nodes in the
> set must be visible to each other by a known CONNECTION. All shared
> tables must already be defined on all nodes.
>
> step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES
>
> step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N
> with "local_only=on, copy_data=on" (this will replicate initial data
> from the node N tables to every other node).
>
> step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent
> replication of 'truncate', then TRUNCATE all tables of node N, then
> re-allow replication of 'truncate'.
>
> step 5. On node N do CREATE SUBSCRIPTION to the publications of all
> other nodes in the set
> 5a. Specify "local_only=on, copy_data=force" for exactly one of the
> subscriptions  (this will make the node N tables now have the same
> data as the other nodes)
> 5b. Specify "local_only=on, copy_data=off" for all other subscriptions.
>
> step 6. Result - Now changes to any table on any node should be
> replicated to every other node in the set.
>
> Note: Steps 4 and 5 need to be done within the same transaction to
> avoid loss of data in case of some command failure. (Because we can't
> perform create subscription in a transaction, we need to create the
> subscription in a disabled mode first and then enable it in the
> transaction).
>
> 2.2 DIFFICULTIES
>
> Notice that it becomes increasingly complex to configure MMLR manually
> as the number of nodes in the set increases. There are also some
> difficulties such as
> - dealing with initial table data
> - coordinating the timing to avoid concurrent updates
> - getting the SUBSCRIPTION options for copy_data exactly right.
>
> 3.0 PROPOSAL
>
> To make the MMLR setup simpler, we propose to create a new API that
> will hide all the step details and remove the burden on the user to
> get it right without mistakes.
>
> 3.1 MOTIVATION
> - MMLR (sharing the same tables) is not currently possible
> - Vignesh's patch [1] makes MMLR possible, but the manual setup is
> still quite difficult
> - An MMLR implementation can solve the timing problems (e.g., using
> Database Locking)
>
> 3.2 API
>
> Preferably the API would be implemented as new SQL functions in
> PostgreSQL core, however, implementation using a contrib module or
> some new SQL syntax may also be possible.
>
> SQL functions will be like below:
> - pg_mmlr_set_create = create a new set, and give it a name
> - pg_mmlr_node_attach = attach the current node to a specified set
> - pg_mmlr_node_detach = detach a specified node from a specified set
> - pg_mmlr_set_delete = delete a specified set
>
> For example, internally the pg_mmlr_node_attach API function would
> execute the equivalent of all the CREATE PUBLICATION, CREATE
> SUBSCRIPTION, and TRUNCATE steps described above.
>
> Notice this proposal has some external API similarities with the BDR
> extension [3] (which also provides multi-master logical replication),
> although we plan to implement it entirely using PostgreSQL’s PUB/SUB.
>
> 4.0 ACKNOWLEDGEMENTS
>
> The following people have contributed to this proposal – Hayato
> Kuroda, Vignesh C, Peter Smith, Amit Kapila.
>
> 5.0 REFERENCES
>
> [1] 
> 

Re: Multi-Master Logical Replication

2022-04-29 Thread vignesh C
On Fri, Apr 29, 2022 at 2:35 PM Peter Smith  wrote:
>
> On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov  wrote:
> >
> > В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> > > On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  
> > > wrote:
> > > > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > > >
> > > > > 1.1 ADVANTAGES OF MMLR
> > > > >
> > > > > - Increases write scalability (e.g., all nodes can write arbitrary 
> > > > > data).
> > > >
> > > > I've never heard how transactional-aware multimaster increases
> > > > write scalability. More over, usually even non-transactional
> > > > multimaster doesn't increase write scalability. At the best it
> > > > doesn't decrease.
> > > >
> > > > That is because all hosts have to write all changes anyway. But
> > > > side cost increases due to increased network interchange and
> > > > interlocking (for transaction-aware MM) and increased latency.
> > >
> > > I agree it won't increase in all cases, but it will be better in a few
> > > cases when the user works on different geographical regions operating
> > > on independent schemas in asynchronous mode. Since the write node is
> > > closer to the geographical zone, the performance will be better in a
> > > few cases.
> >
> > From EnterpriseDB BDB page [1]:
> >
> > > Adding more master nodes to a BDR Group does not result in
> > > significant write throughput increase when most tables are
> > > replicated because BDR has to replay all the writes on all nodes.
> > > Because BDR writes are in general more effective than writes coming
> > > from Postgres clients via SQL, some performance increase can be
> > > achieved. Read throughput generally scales linearly with the number
> > > of nodes.
> >
> > And I'm sure EnterpriseDB does the best.
> >
> > > > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > > > Dear Laurenz,
> > > > >
> > > > > Thank you for your interest in our works!
> > > > >
> > > > > > I am missing a discussion how replication conflicts are handled to
> > > > > > prevent replication from breaking
> > > > >
> > > > > Actually we don't have plans for developing the feature that avoids 
> > > > > conflict.
> > > > > We think that it should be done as core PUB/SUB feature, and
> > > > > this module will just use that.
> > > >
> > > > If you really want to have some proper isolation levels (
> > > > Read Committed? Repeatable Read?) and/or want to have
> > > > same data on each "master", there is no easy way. If you
> > > > think it will be "easy", you are already wrong.
> > >
> > > The synchronous_commit and synchronous_standby_names configuration
> > > parameters will help in getting the same data across the nodes. Can
> > > you give an example for the scenario where it will be difficult?
> >
> > So, synchronous or asynchronous?
> > Synchronous commit on every master, every alive master or on quorum
> > of masters?
> >
> > And it is not about synchronicity. It is about determinism at
> > conflicts.
> >
> > If you have fully determenistic conflict resolution that works
> > exactly same way on each host, then it is possible to have same
> > data on each host. (But it will not be transactional.)And it seems EDB BDB 
> > achieved this.
> >
> > Or if you have fully and correctly implemented one of distributed
> > transactions protocols.
> >
> > [1]  
> > https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance
> >
> > regards
> >
> > --
> >
> > Yura Sokolov
>
> Thanks for your feedback.
>
> This MMLR proposal was mostly just to create an interface making it
> easier to use PostgreSQL core logical replication CREATE
> PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
> Otherwise, this is difficult for a user to do manually. (e.g.
> difficulties as mentioned in section 2.2 of the original post [1] -
> dealing with initial table data, coordinating the timing/locking to
> avoid concurrent updates, getting the SUBSCRIPTION options for
> copy_data exactly right etc)

Different problems and how to solve each scenario is mentioned detailly in [1].
It gets even more complex when there are more nodes associated, let's
consider the 3 node case:
Adding a new node node3 to the existing node1 and node2 when data is
present in existing nodes node1 and node2, the following steps are
required:
Create a publication in node3:
CREATE PUBLICATION pub_node3 for all tables;

Create a subscription in node1 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node1_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Create a subscription in node2 to subscribe the changes from node3:
CREATE SUBSCRIPTION sub_node2_node3 CONNECTION 'dbname=foo host=node3
user=repuser' PUBLICATION pub_node3 WITH (copy_data = off, local_only
= on);

Lock database at node2 and wait till walsender sends WAL to node1(upto
current lsn) to avoid any data loss because of node2's WAL not being
sent to node1. This lock needs 

Re: Multi-Master Logical Replication

2022-04-29 Thread Peter Smith
On Fri, Apr 29, 2022 at 2:16 PM Yura Sokolov  wrote:
>
> В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> > On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  
> > wrote:
> > > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > >
> > > > 1.1 ADVANTAGES OF MMLR
> > > >
> > > > - Increases write scalability (e.g., all nodes can write arbitrary 
> > > > data).
> > >
> > > I've never heard how transactional-aware multimaster increases
> > > write scalability. More over, usually even non-transactional
> > > multimaster doesn't increase write scalability. At the best it
> > > doesn't decrease.
> > >
> > > That is because all hosts have to write all changes anyway. But
> > > side cost increases due to increased network interchange and
> > > interlocking (for transaction-aware MM) and increased latency.
> >
> > I agree it won't increase in all cases, but it will be better in a few
> > cases when the user works on different geographical regions operating
> > on independent schemas in asynchronous mode. Since the write node is
> > closer to the geographical zone, the performance will be better in a
> > few cases.
>
> From EnterpriseDB BDB page [1]:
>
> > Adding more master nodes to a BDR Group does not result in
> > significant write throughput increase when most tables are
> > replicated because BDR has to replay all the writes on all nodes.
> > Because BDR writes are in general more effective than writes coming
> > from Postgres clients via SQL, some performance increase can be
> > achieved. Read throughput generally scales linearly with the number
> > of nodes.
>
> And I'm sure EnterpriseDB does the best.
>
> > > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > > Dear Laurenz,
> > > >
> > > > Thank you for your interest in our works!
> > > >
> > > > > I am missing a discussion how replication conflicts are handled to
> > > > > prevent replication from breaking
> > > >
> > > > Actually we don't have plans for developing the feature that avoids 
> > > > conflict.
> > > > We think that it should be done as core PUB/SUB feature, and
> > > > this module will just use that.
> > >
> > > If you really want to have some proper isolation levels (
> > > Read Committed? Repeatable Read?) and/or want to have
> > > same data on each "master", there is no easy way. If you
> > > think it will be "easy", you are already wrong.
> >
> > The synchronous_commit and synchronous_standby_names configuration
> > parameters will help in getting the same data across the nodes. Can
> > you give an example for the scenario where it will be difficult?
>
> So, synchronous or asynchronous?
> Synchronous commit on every master, every alive master or on quorum
> of masters?
>
> And it is not about synchronicity. It is about determinism at
> conflicts.
>
> If you have fully determenistic conflict resolution that works
> exactly same way on each host, then it is possible to have same
> data on each host. (But it will not be transactional.)And it seems EDB BDB 
> achieved this.
>
> Or if you have fully and correctly implemented one of distributed
> transactions protocols.
>
> [1]  
> https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance
>
> regards
>
> --
>
> Yura Sokolov

Thanks for your feedback.

This MMLR proposal was mostly just to create an interface making it
easier to use PostgreSQL core logical replication CREATE
PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
Otherwise, this is difficult for a user to do manually. (e.g.
difficulties as mentioned in section 2.2 of the original post [1] -
dealing with initial table data, coordinating the timing/locking to
avoid concurrent updates, getting the SUBSCRIPTION options for
copy_data exactly right etc)

At this time we have no provision for HA, nor for transaction
consistency awareness, conflict resolutions, node failure detections,
DDL replication etc. Some of the features like DDL replication are
currently being implemented [2], so when committed it will become
available in the core, and can then be integrated into this module.

Once the base feature of the current MMLR proposal is done, perhaps it
can be extended in subsequent versions.

Probably our calling this “Multi-Master” has been
misleading/confusing, because that term implies much more to other
readers. We really only intended it to mean the ability to set up
logical replication across a set of nodes. Of course, we can rename
the proposal (and API) to something different if there are better
suggestions.

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuwRAoWY9pz%3DEubps3ooQCOBFiYPU9Yi%3DVB-U%2ByORU7OA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/45d0d97c-3322-4054-b94f-3c08774bbd90%40www.fastmail.com#db6e810fc93f17b0a5585bac25fb3d4b

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Multi-Master Logical Replication

2022-04-28 Thread Yura Sokolov
В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  wrote:
> > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > 
> > > 1.1 ADVANTAGES OF MMLR
> > > 
> > > - Increases write scalability (e.g., all nodes can write arbitrary data).
> > 
> > I've never heard how transactional-aware multimaster increases
> > write scalability. More over, usually even non-transactional
> > multimaster doesn't increase write scalability. At the best it
> > doesn't decrease.
> > 
> > That is because all hosts have to write all changes anyway. But
> > side cost increases due to increased network interchange and
> > interlocking (for transaction-aware MM) and increased latency.
> 
> I agree it won't increase in all cases, but it will be better in a few
> cases when the user works on different geographical regions operating
> on independent schemas in asynchronous mode. Since the write node is
> closer to the geographical zone, the performance will be better in a
> few cases.

>From EnterpriseDB BDB page [1]:

> Adding more master nodes to a BDR Group does not result in
> significant write throughput increase when most tables are
> replicated because BDR has to replay all the writes on all nodes.
> Because BDR writes are in general more effective than writes coming
> from Postgres clients via SQL, some performance increase can be
> achieved. Read throughput generally scales linearly with the number
> of nodes.

And I'm sure EnterpriseDB does the best.

> > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > Dear Laurenz,
> > > 
> > > Thank you for your interest in our works!
> > > 
> > > > I am missing a discussion how replication conflicts are handled to
> > > > prevent replication from breaking
> > > 
> > > Actually we don't have plans for developing the feature that avoids 
> > > conflict.
> > > We think that it should be done as core PUB/SUB feature, and
> > > this module will just use that.
> > 
> > If you really want to have some proper isolation levels (
> > Read Committed? Repeatable Read?) and/or want to have
> > same data on each "master", there is no easy way. If you
> > think it will be "easy", you are already wrong.
> 
> The synchronous_commit and synchronous_standby_names configuration
> parameters will help in getting the same data across the nodes. Can
> you give an example for the scenario where it will be difficult?

So, synchronous or asynchronous?
Synchronous commit on every master, every alive master or on quorum
of masters?

And it is not about synchronicity. It is about determinism at
conflicts.

If you have fully determenistic conflict resolution that works
exactly same way on each host, then it is possible to have same
data on each host. (But it will not be transactional.)And it seems EDB BDB 
achieved this.

Or if you have fully and correctly implemented one of distributed
transactions protocols.

[1]  
https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance

regards

--

Yura Sokolov





Re: Multi-Master Logical Replication

2022-04-28 Thread vignesh C
On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  wrote:
>
> В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
>
> > 1.1 ADVANTAGES OF MMLR
> >
> > - Increases write scalability (e.g., all nodes can write arbitrary data).
>
> I've never heard how transactional-aware multimaster increases
> write scalability. More over, usually even non-transactional
> multimaster doesn't increase write scalability. At the best it
> doesn't decrease.
>
> That is because all hosts have to write all changes anyway. But
> side cost increases due to increased network interchange and
> interlocking (for transaction-aware MM) and increased latency.

I agree it won't increase in all cases, but it will be better in a few
cases when the user works on different geographical regions operating
on independent schemas in asynchronous mode. Since the write node is
closer to the geographical zone, the performance will be better in a
few cases.

> В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > Dear Laurenz,
> >
> > Thank you for your interest in our works!
> >
> > > I am missing a discussion how replication conflicts are handled to
> > > prevent replication from breaking
> >
> > Actually we don't have plans for developing the feature that avoids 
> > conflict.
> > We think that it should be done as core PUB/SUB feature, and
> > this module will just use that.
>
> If you really want to have some proper isolation levels (
> Read Committed? Repeatable Read?) and/or want to have
> same data on each "master", there is no easy way. If you
> think it will be "easy", you are already wrong.

The synchronous_commit and synchronous_standby_names configuration
parameters will help in getting the same data across the nodes. Can
you give an example for the scenario where it will be difficult?

Regards,
Vignesh




Re: Multi-Master Logical Replication

2022-04-28 Thread Yura Sokolov
В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

> 1.1 ADVANTAGES OF MMLR
> 
> - Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> Dear Laurenz,
> 
> Thank you for your interest in our works!
> 
> > I am missing a discussion how replication conflicts are handled to
> > prevent replication from breaking
> 
> Actually we don't have plans for developing the feature that avoids conflict.
> We think that it should be done as core PUB/SUB feature, and
> this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

Our company has MultiMaster which is built on top of
logical replication. It is even partially open source
( https://github.com/postgrespro/mmts ) , although some
core patches that have to be done for are not up to
date.

And it is second iteration of MM. First iteration were
not "simple" or "easy" already. But even that version had
the hidden bug: rare but accumulating data difference
between nodes. Attempt to fix this bug led to almost
full rewrite of multi-master.

(Disclaimer: I had no relation to both MM versions,
I just work in the same firm).


regards

-

Yura Sokolov





RE: Multi-Master Logical Replication

2022-04-28 Thread kuroda.hay...@fujitsu.com
Dear Laurenz,

Thank you for your interest in our works!

> I am missing a discussion how replication conflicts are handled to
> prevent replication from breaking or the databases from drifting apart.

Actually we don't have plans for developing the feature that avoids conflict.
We think that it should be done as core PUB/SUB feature, and
this module will just use that.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Multi-Master Logical Replication

2022-04-28 Thread Laurenz Albe
On Thu, 2022-04-28 at 09:49 +1000, Peter Smith wrote:
> To prevent the circular recursive problem Vignesh is developing a
> patch [1] that introduces new SUBSCRIPTION options "local_only" (for
> publishing only data originating at the publisher node) and
> "copy_data=force". Using this patch, we have created a script [2]
> demonstrating how to set up all the above multi-node examples. An
> overview of the necessary steps is given in the next section.

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking or the databases from drifting apart.

Yours,
Laurenz Albe