Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Friday, June 29, 2012 02:43:49 PM Boszormenyi Zoltan wrote: > 2012-06-19 09:24 keltezéssel, Andres Freund írta: > > On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote: > >> On 12-06-18 07:30 AM, Andres Freund wrote: > >>> Hrmpf #666. I will go through through the series commit-by-commit again > >>> to make sure everything compiles again. Reordinging this late > >>> definitely wasn't a good idea... > >>> > >>> I pushed a rebased version with all those fixups (and removal of the > >>> zeroRecPtr patch). > >> > >> Where did you push that rebased version to? I don't see an attachment, > >> or an updated patch in the commitfest app and your repo at > >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=su > >> mma ry hasn't been updated in 5 days. > > > > To the 2ndquadrant internal repo. Which strangely doesn't help you. > > *Headdesk*. Pushed to the correct repo and manually verified. > > Which repository is the correct one? > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git > was refreshed 11 days ago. The patch taken from there fails with a reject > in src/include/access/xlog.h. Thats the right repository but Heikki's recent' changes (dfda6ebaec6763090fb78b458a979b558c50b39b and several following) changed quite a bit around on master and I am currently rebasing and addressing review comments. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
2012-06-19 09:24 keltezéssel, Andres Freund írta: On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote: On 12-06-18 07:30 AM, Andres Freund wrote: Hrmpf #666. I will go through through the series commit-by-commit again to make sure everything compiles again. Reordinging this late definitely wasn't a good idea... I pushed a rebased version with all those fixups (and removal of the zeroRecPtr patch). Where did you push that rebased version to? I don't see an attachment, or an updated patch in the commitfest app and your repo at http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa ry hasn't been updated in 5 days. To the 2ndquadrant internal repo. Which strangely doesn't help you. *Headdesk*. Pushed to the correct repo and manually verified. Which repository is the correct one? http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git was refreshed 11 days ago. The patch taken from there fails with a reject in src/include/access/xlog.h. Andres -- -- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 10:12:46 PM Aidan Van Dyk wrote: > On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund wrote: > > On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote: > >> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund > > > > wrote: > >> >> OK, so in this case, I still don't see how the "origin_id" is even > >> >> enough. > >> >> > >> >> C applies the change originally from A (routed through B, because > >> >> it's faster). But when it get's the change directly from A, how > >> >> does it know to *not* apply it again? > >> > > >> > The lsn of the change. > >> > >> So why isn't the LSN good enough for when C propagates the change back > >> to A? > > > > Because every node has individual progress in the wal so the lsn doesn't > > mean anything unless you know from which node it originally is. > > > >> Why does A need more information than C? > > > > Didn't I explain that two mails up? > > Probably, but that didn't mean I understood it... I'm trying to keep up > here ;-) Heh. Yes. This developed into a huge mess already ;) > So the origin_id isn't strictly for the origin node to know filter an > LCR it's applied already, but it is also to correlate the LSN's > because the LSN's of the re-generated LCR's are meant to contain the > originating nodes's LSN, and every every node applying LCRs needs to > be able to know where it is in every node's LSN progress. There are multiple use-cases for it, this is one of them. > I had assumed any LCR's generated on a node woudl be relative to the > LSN sequencing of that node. We have the original lsn in the commit record. Thats needed to support crash recovery because you need to know where to restart applying changes again from. > > Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in > > north america. > > #1 wants changes from #3 and #4 when talking to #3 but not those from #2 > > and itself (because that would be cheaper to get locally). > > Right, but if the link between #1 and #2 ever "slows down", changes > from #3 and #4 may very well already have #2's changes, and even > require them. > > #1 has to apply them, or is it going to stop applying LCR's from #3 > when it see's LCRs from #3 coming in that originate on #2 and have > LSNs greater than what it's so far received from #2? We will see ;). There are several solutions possible, possibly even depending on the use-case. This patch just want's to provide the very basic information required to implement such solutions... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund wrote: > On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote: >> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund > wrote: >> >> OK, so in this case, I still don't see how the "origin_id" is even >> >> enough. >> >> >> >> C applies the change originally from A (routed through B, because it's >> >> faster). But when it get's the change directly from A, how does it >> >> know to *not* apply it again? >> > >> > The lsn of the change. >> >> So why isn't the LSN good enough for when C propagates the change back to >> A? > Because every node has individual progress in the wal so the lsn doesn't mean > anything unless you know from which node it originally is. > >> Why does A need more information than C? > Didn't I explain that two mails up? Probably, but that didn't mean I understood it... I'm trying to keep up here ;-) So the origin_id isn't strictly for the origin node to know filter an LCR it's applied already, but it is also to correlate the LSN's because the LSN's of the re-generated LCR's are meant to contain the originating nodes's LSN, and every every node applying LCRs needs to be able to know where it is in every node's LSN progress. I had assumed any LCR's generated on a node woudl be relative to the LSN sequencing of that node. > Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north > america. > #1 wants changes from #3 and #4 when talking to #3 but not those from #2 and > itself (because that would be cheaper to get locally). Right, but if the link between #1 and #2 ever "slows down", changes from #3 and #4 may very well already have #2's changes, and even require them. #1 has to apply them, or is it going to stop applying LCR's from #3 when it see's LCRs from #3 coming in that originate on #2 and have LSNs greater than what it's so far received from #2? -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote: > On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund wrote: > >> OK, so in this case, I still don't see how the "origin_id" is even > >> enough. > >> > >> C applies the change originally from A (routed through B, because it's > >> faster). But when it get's the change directly from A, how does it > >> know to *not* apply it again? > > > > The lsn of the change. > > So why isn't the LSN good enough for when C propagates the change back to > A? Because every node has individual progress in the wal so the lsn doesn't mean anything unless you know from which node it originally is. > Why does A need more information than C? Didn't I explain that two mails up? Perhaps Chris' phrasing explains the basic idea better: On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote: > The case where it would be needful is if you are in the process of > assembling together updates coming in from multiple masters, and need > to know: >- This INSERT was replicated from node #1, so should be ignored > downstream - That INSERT was replicated from node #2, so should be ignored > downstream - This UPDATE came from the local node, so needs to be passed > to downstream users Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north america. #1 wants changes from #3 and #4 when talking to #3 but not those from #2 and itself (because that would be cheaper to get locally). Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund wrote: >> OK, so in this case, I still don't see how the "origin_id" is even enough. >> >> C applies the change originally from A (routed through B, because it's >> faster). But when it get's the change directly from A, how does it >> know to *not* apply it again? > The lsn of the change. So why isn't the LSN good enough for when C propagates the change back to A? Why does A need more information than C? a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 09:23:34 PM Heikki Linnakangas wrote: > > And then we just put the originid on each heap record for MMR, in some > > manner, discussed later. > > I reserve the right to object to that, too :-). Others raised the > concern that a 16-bit integer is not a very intuitive identifier. Also, > as discussed, for more complex scenarios just the originid is not > sufficient. ISTM that we need more flexibility. I think the '16bit integer is unintiuitive' argument isn't that interesting. As pointed out by multiple people in the thread that origin_id can be local and mapped to something more complex in communication between the different nodes and the configuration. Before applying changes from another node you lookup their "complex id" into the locally mapped 16bit origin_id which then gets written into the wal stream. When decoding the wal stream into the LCR stream its mapped the other way. We might need more information than that at a later point but those probably won't needed during low-level filtering of wal before reassembling it into transactions... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 09:24:29 PM Aidan Van Dyk wrote: > On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund wrote: > > To recap why we think origin_id is a sensible design choice: > > > > There are many sensible replication topologies where it does make sense > > that you want to receive changes (on node C) from one node (say B) that > > originated from some other node (say A). > > Reasons include: > > * the order of applying changes should be as similar as possible on all > > nodes. That means when applying a change on C that originated on B and > > if changes replicated faster from A->B than from A->C you want to be at > > least as far with the replication from A as B was. Otherwise the > > conflict ratio will increase. If you can recreate the stream from the > > wal of every node and still detect where an individual change > > originated, thats easy. > > OK, so in this case, I still don't see how the "origin_id" is even enough. > > C applies the change originally from A (routed through B, because it's > faster). But when it get's the change directly from A, how does it > know to *not* apply it again? The lsn of the change. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 03:23, Heikki Linnakangas wrote: >> And then we just put the originid on each heap record for MMR, in some >> manner, discussed later. > > > I reserve the right to object to that, too :-). OK. But that would be only for MMR, using special record types. > Others raised the concern > that a 16-bit integer is not a very intuitive identifier. Of course > Also, as > discussed, for more complex scenarios just the originid is not sufficient. > ISTM that we need more flexibility. Of course -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 03:13, Heikki Linnakangas wrote: > On 20.06.2012 21:51, Simon Riggs wrote: >> >> On 21 June 2012 02:32, Heikki Linnakangas >> wrote: >>> >>> I'm not saying that we need to implement all possible conflict resolution >>> >>> algorithms right now - on the contrary I think conflict resolution >>> belongs >>> outside core >> >> >> It's a pretty standard requirement to have user defined conflict >> handling, if that's what you mean. >> >> I'm OK with conflict handling being outside of core as a module, if >> that's what people think. We just need a commit hook. That is more >> likely to allow a workable solution with 9.3 as well, ISTM. It's >> likely to be contentious... > > > Hmm, what do you need to happen at commit time? > > We already have RegisterXactCallback(), if that's enough... Let me refer back to my notes and discuss this later, I'm just going from vague memory here, which isn't helpful. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund wrote: > To recap why we think origin_id is a sensible design choice: > > There are many sensible replication topologies where it does make sense that > you want to receive changes (on node C) from one node (say B) that originated > from some other node (say A). > Reasons include: > * the order of applying changes should be as similar as possible on all nodes. > That means when applying a change on C that originated on B and if changes > replicated faster from A->B than from A->C you want to be at least as far with > the replication from A as B was. Otherwise the conflict ratio will increase. > If you can recreate the stream from the wal of every node and still detect > where an individual change originated, thats easy. OK, so in this case, I still don't see how the "origin_id" is even enough. C applies the change originally from A (routed through B, because it's faster). But when it get's the change directly from A, how does it know to *not* apply it again? > * the interconnects between some nodes may be more expensive than from others > * an interconnect between two nodes may fail but others dont > > Because of that we think its sensible to be able generate the full LCR stream > with all changes, local and remote ones, on each individual node. If you then > can filter on individual origin_id's you can build complex replication > topologies without much additional complexity. > >> I'm not saying that we need to implement all possible conflict >> resolution algorithms right now - on the contrary I think conflict >> resolution belongs outside core - but if we're going to change the WAL >> record format to support such conflict resolution, we better make sure >> the foundation we provide for it is solid. > I think this already provides a lot. At some point we probably want to have > support for looking on which node a certain local xid originated and when that > was originally executed. While querying that efficiently requires additional > support we already have all the information for that. > > There are some more complexities with consistently determining conflicts on > changes that happened in a very small timewindown on different nodes but thats > something for another day. > >> BTW, one way to work around the lack of origin id in the WAL record >> header is to just add an origin-id column to the table, indicating the >> last node that updated the row. That would be a kludge, but I thought >> I'd mention it.. > Yuck. The aim is to improve on whats done today ;) > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 22:11, Simon Riggs wrote: On 21 June 2012 02:56, Simon Riggs wrote: I think allowing rmgrs to redefine the wasted bytes in the header is the best idea. Hmm, I think the best idea is to save 2 bytes off the WAL header for all records, so there are no wasted bytes on 64bit or 32bit. That way the potential for use goes away and there's benefit for all, plus no argument about how to use those bytes in rarer cases. I'll work on that. I don't think that's actually necessary, the WAL bloat isn't *that* bad that we need to start shaving bytes from there. I was just trying to make a point. And then we just put the originid on each heap record for MMR, in some manner, discussed later. I reserve the right to object to that, too :-). Others raised the concern that a 16-bit integer is not a very intuitive identifier. Also, as discussed, for more complex scenarios just the originid is not sufficient. ISTM that we need more flexibility. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote: > On 20.06.2012 17:35, Simon Riggs wrote: > > On 20 June 2012 16:23, Heikki Linnakangas > > > > wrote: > >> On 20.06.2012 11:17, Simon Riggs wrote: > >>> On 20 June 2012 15:45, Heikki Linnakangas > >>> > >>> wrote: > So, if the origin id is not sufficient for some conflict resolution > mechanisms, what extra information do you need for those, and where do > you put it? > >>> > >>> As explained elsewhere, wal_level = logical (or similar) would be used > >>> to provide any additional logical information required. > >>> > >>> Update and Delete WAL records already need to be different in that > >>> mode, so additional info would be placed there, if there were any. > >>> > >>> In the case of reflexive updates you raised, a typical response in > >>> other DBMS would be to represent the query > >>> > >>>UPDATE SET counter = counter + 1 > >>> > >>> by sending just the "+1" part, not the current value of counter, as > >>> would be the case with the non-reflexive update > >>> > >>>UPDATE SET counter = 1 > >>> > >>> Handling such things in Postgres would require some subtlety, which > >>> would not be resolved in first release but is pretty certain not to > >>> require any changes to the WAL record header as a way of resolving it. > >>> Having already thought about it, I'd estimate that is a very long > >>> discussion and not relevant to the OT, but if you wish to have it > >>> here, I won't stop you. > >> > >> Yeah, I'd like to hear briefly how you would handle that without any > >> further changes to the WAL record header. > > > > I already did: > >>> Update and Delete WAL records already need to be different in that > >>> mode, so additional info would be placed there, if there were any. > > > > The case you mentioned relates to UPDATEs only, so I would suggest > > that we add that information to a new form of update record only. > > > > That has nothing to do with the WAL record header. > > Hmm, so you need the origin id in the WAL record header to do filtering. > Except when that's not enough, you add some more fields to heap update > and delete records. Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal and should be ignored for quite a while. We might go to something like it sometime in the future but its nothing we work on (as far as I know ;)). wal_level=logical (in patch 07) currently only changes the following things about the wal stream: For HEAP_(INSERT|(HOT_)?UPDATE|DELETE) * prevent full page writes from removing the row data (could be optimized at some point to just store the tuple slot) For HEAP_DELETE * add the primary key of the changed row HEAP_MULTI_INSERT obviously needs to get the same treatment in future. The only real addition that I forsee in the near future is logging the old primary key when the primary key changes in HEAP_UPDATE. Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE) > Don't you think it would be simpler to only add the extra fields to heap > insert, update and delete records, and leave the WAL record header > alone? Do you ever need extra information on other record types? Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT, HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT, ABORT, ABORT_PREPARED) and probably some I didn't remember right now. Sure, we can add it to all those but then you need to have individual knowledge about *all* of those because the location where its stored will be different for each of them. To recap why we think origin_id is a sensible design choice: There are many sensible replication topologies where it does make sense that you want to receive changes (on node C) from one node (say B) that originated from some other node (say A). Reasons include: * the order of applying changes should be as similar as possible on all nodes. That means when applying a change on C that originated on B and if changes replicated faster from A->B than from A->C you want to be at least as far with the replication from A as B was. Otherwise the conflict ratio will increase. If you can recreate the stream from the wal of every node and still detect where an individual change originated, thats easy. * the interconnects between some nodes may be more expensive than from others * an interconnect between two nodes may fail but others dont Because of that we think its sensible to be able generate the full LCR stream with all changes, local and remote ones, on each individual node. If you then can filter on individual origin_id's you can build complex replication topologies without much additional complexity. > I'm not saying that we need to implement all possible conflict > resolution algorithms right now - on the contrary I think conflict > resolution belongs outside core - but if we're going to change the WAL > record format to support such c
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 21:51, Simon Riggs wrote: On 21 June 2012 02:32, Heikki Linnakangas wrote: I'm not saying that we need to implement all possible conflict resolution algorithms right now - on the contrary I think conflict resolution belongs outside core It's a pretty standard requirement to have user defined conflict handling, if that's what you mean. I'm OK with conflict handling being outside of core as a module, if that's what people think. We just need a commit hook. That is more likely to allow a workable solution with 9.3 as well, ISTM. It's likely to be contentious... Hmm, what do you need to happen at commit time? We already have RegisterXactCallback(), if that's enough... BTW, one way to work around the lack of origin id in the WAL record header is to just add an origin-id column to the table, indicating the last node that updated the row. That would be a kludge, but I thought I'd mention it.. err, I hope you mean that to be funny. (It wouldn't actually work either) No, I wasn't serious that we should implement it that way. But now you made me curious; why would it not work? If there's an origin-id column in a table, it's included in every heap insert/delete/update WAL record. Just set it to the current node's id on a local modification, and to the origin's id when replaying changes from another node, and you have the exact same information as you would with the extra field in WAL record header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 02:56, Simon Riggs wrote: > I think allowing rmgrs to redefine the wasted bytes in the header is > the best idea. Hmm, I think the best idea is to save 2 bytes off the WAL header for all records, so there are no wasted bytes on 64bit or 32bit. That way the potential for use goes away and there's benefit for all, plus no argument about how to use those bytes in rarer cases. I'll work on that. And then we just put the originid on each heap record for MMR, in some manner, discussed later. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 02:45, Heikki Linnakangas wrote: > On 20.06.2012 16:46, Simon Riggs wrote: >> >> The proposal now includes flag bits that would allow the addition of a >> variable length header, should that ever become necessary. So the >> unused space in the fixed header is not being "used up" as you say. In >> any case, the fixed header still has 4 wasted bytes on 64bit systems >> even after the patch is applied. So this claim of short sightedness is >> just plain wrong. >> >> ... > >> >> >> We need to add information to every WAL record that is used as the >> source for generating LCRs. It is also possible to add this to HEAP >> and HEAP2 records, but doing that *will* bloat the WAL stream, whereas >> using the *currently wasted* bytes on a WAL record header does *not* >> bloat the WAL stream. > Wonderful ideas, these look good. > Or, we could provide a mechanism for resource managers to use those padding > bytes for whatever data the wish to use. Sounds better to me. > Or modify the record format so that > the last 4 bytes of the data in the WAL record are always automatically > stored in those padding bytes, thus making all WAL records 4 bytes shorter. > That would make the WAL even more compact, with only a couple of extra CPU > instructions in the critical path. Sounds cool, but a little weird, even for me. > My point is that it's wrong to think that it's free to use those bytes, just > because they're currently unused. If we use them for one thing, we can't use > them for other things anymore. If we're so concerned about WAL bloat that we > can't afford to add any more bytes to the WAL record header or heap WAL > records, then it would be equally fruitful to look at ways to use those > padding bytes to save that precious WAL space. Agreed. Thanks for sharing those ideas. Exactly why I like the list (really...) > I don't think we're *that* concerned about the WAL bloat, however. So let's > see what is the most sensible place to add whatever extra information we > need in the WAL, from the point of view of maintainability, flexibility, > readability etc. Then we can decide where to put it. Removing FPW is still most important aspect there. I think allowing rmgrs to redefine the wasted bytes in the header is the best idea. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 02:32, Heikki Linnakangas wrote: > On 20.06.2012 17:35, Simon Riggs wrote: >> >> On 20 June 2012 16:23, Heikki Linnakangas >> wrote: >>> >>> On 20.06.2012 11:17, Simon Riggs wrote: On 20 June 2012 15:45, Heikki Linnakangas wrote: > > > So, if the origin id is not sufficient for some conflict resolution > mechanisms, what extra information do you need for those, and where do > you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the "+1" part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. >>> >>> >>> >>> Yeah, I'd like to hear briefly how you would handle that without any >>> further >>> changes to the WAL record header. >> >> >> I already did: >> Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. >> >> >> The case you mentioned relates to UPDATEs only, so I would suggest >> that we add that information to a new form of update record only. >> >> That has nothing to do with the WAL record header. > > > Hmm, so you need the origin id in the WAL record header to do filtering. > Except when that's not enough, you add some more fields to heap update and > delete records. Yes > Don't you think it would be simpler to only add the extra fields to heap > insert, update and delete records, and leave the WAL record header alone? Do > you ever need extra information on other record types? No extra info on other record types, in general at least. Doing it that way is the most logical place, just not the most efficient. > The other question is, *what* information do you need to put there? We don't > necessarily need to have a detailed design of that right now, but it seems > quite clear to me that we need more flexibility there than this patch > provides, to support more complicated conflict resolution. Another patch already covers that for the common non-reflexive case. More on conflict resolution soon, I guess. I was going to begin working on that in July. Starting with the design discussed on list, of course. This patch has thrown up stuff I thought was compartmentalised, but I was wrong. > I'm not saying that we need to implement all possible conflict resolution > algorithms right now - on the contrary I think conflict resolution belongs > outside core It's a pretty standard requirement to have user defined conflict handling, if that's what you mean. I'm OK with conflict handling being outside of core as a module, if that's what people think. We just need a commit hook. That is more likely to allow a workable solution with 9.3 as well, ISTM. It's likely to be contentious... > - but if we're going to change the WAL record format to support > such conflict resolution, we better make sure the foundation we provide for > it is solid. Agreed > BTW, one way to work around the lack of origin id in the WAL record header > is to just add an origin-id column to the table, indicating the last node > that updated the row. That would be a kludge, but I thought I'd mention it.. err, I hope you mean that to be funny. (It wouldn't actually work either) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 16:46, Simon Riggs wrote: The proposal now includes flag bits that would allow the addition of a variable length header, should that ever become necessary. So the unused space in the fixed header is not being "used up" as you say. In any case, the fixed header still has 4 wasted bytes on 64bit systems even after the patch is applied. So this claim of short sightedness is just plain wrong. > ... > We need to add information to every WAL record that is used as the source for generating LCRs. It is also possible to add this to HEAP and HEAP2 records, but doing that *will* bloat the WAL stream, whereas using the *currently wasted* bytes on a WAL record header does *not* bloat the WAL stream. Or, we could provide a mechanism for resource managers to use those padding bytes for whatever data the wish to use. Or modify the record format so that the last 4 bytes of the data in the WAL record are always automatically stored in those padding bytes, thus making all WAL records 4 bytes shorter. That would make the WAL even more compact, with only a couple of extra CPU instructions in the critical path. My point is that it's wrong to think that it's free to use those bytes, just because they're currently unused. If we use them for one thing, we can't use them for other things anymore. If we're so concerned about WAL bloat that we can't afford to add any more bytes to the WAL record header or heap WAL records, then it would be equally fruitful to look at ways to use those padding bytes to save that precious WAL space. I don't think we're *that* concerned about the WAL bloat, however. So let's see what is the most sensible place to add whatever extra information we need in the WAL, from the point of view of maintainability, flexibility, readability etc. Then we can decide where to put it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 07:50:37 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund wrote: > >> I realized a problem with that idea this morning: it might work for > >> reading things, but if anyone attempts to write data you've got big > >> problems. Maybe we could get away with forbidding that, not sure. > > > > Hm, why is writing a problem? You mean io conversion routines writing > > data? Yes, that will be a problem. I am fine with simply forbidding > > that, we should be able to catch that and provide a sensible error > > message, since SSI we have the support for that. > I think we could do something a little more vigorous than that, like > maybe error out if anyone tries to do anything that would write WAL or > acquire an XID. I would go for all of them ;). The read only transaction warnings will probably result in the best error messages. > Of course, then the question becomes: then what? We > probably need to think about what happens after that - we don't want > an error replicating one row in one table to permanently break > replication for the entire system. Interesting problem, yes. My first reaction was to just to warn and log and throw the transaction away... But thats not likely to work very well on the apply side... I don't think its a particularly likely problem though. An io function doing that would probably fail in lots of other situations (HS, read only transactions, permission problems). > >> Sorry. I don't think you're planning to do something evil, but before > >> I thought you said you did NOT want to write the code to extract > >> changes as text or something similar. > > Hm. I might have been a bit ambiguous when saying that I do not want to > > provide everything for that use-case. > > Once we have a callpoint that has a correct catalog snapshot for exactly > > the tuple in question text conversion is damn near trivial. The point > > where you get passed all that information (action, tuple, table, > > snapshot) is the one I think the patch should mainly provide. > This is actually a very interesting list. We could rephrase the > high-level question about the design of this feature as "what is the > best way to make sure that you have these things available?". Action > and tuple are trivial to get, and table isn't too hard either. It's > really the snapshot - and all the downstream information that can only > be obtained via using that snapshot - that is the hard part. For others, a sensible entry point into this discussion before switching subthreads probably is in http://archives.postgresql.org/message- id/201206192023.20589.and...@2ndquadrant.com The table isn't as easy as you might think as the wal record only contains the relfilenode. Unless you want to log more its basically solved by solving the snapshot issue. And yes, agreeing on how to do that is the thing we need to solve next. Patch 07 (add enough information to wal) and this are the first ones that should get committed imo. And be ready for that obviously. Just noticed that I failed to properly add Patch 07 to the commitfest. I have done so now, hope thats ok. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 17:35, Simon Riggs wrote: On 20 June 2012 16:23, Heikki Linnakangas wrote: On 20.06.2012 11:17, Simon Riggs wrote: On 20 June 2012 15:45, Heikki Linnakangas wrote: So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the "+1" part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. Yeah, I'd like to hear briefly how you would handle that without any further changes to the WAL record header. I already did: Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. The case you mentioned relates to UPDATEs only, so I would suggest that we add that information to a new form of update record only. That has nothing to do with the WAL record header. Hmm, so you need the origin id in the WAL record header to do filtering. Except when that's not enough, you add some more fields to heap update and delete records. Don't you think it would be simpler to only add the extra fields to heap insert, update and delete records, and leave the WAL record header alone? Do you ever need extra information on other record types? The other question is, *what* information do you need to put there? We don't necessarily need to have a detailed design of that right now, but it seems quite clear to me that we need more flexibility there than this patch provides, to support more complicated conflict resolution. I'm not saying that we need to implement all possible conflict resolution algorithms right now - on the contrary I think conflict resolution belongs outside core - but if we're going to change the WAL record format to support such conflict resolution, we better make sure the foundation we provide for it is solid. BTW, one way to work around the lack of origin id in the WAL record header is to just add an origin-id column to the table, indicating the last node that updated the row. That would be a kludge, but I thought I'd mention it.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 01:06, Christopher Browne wrote: > I guess I'm not seeing the purpose to having the origin node id in the > WAL stream either. > > We have it in the Slony sl_log_* stream, however there is a crucial > difference, in that sl_log_* is expressly a shared structure. In > contrast, WAL isn't directly sharable; you don't mix together multiple > WAL streams. Unfortunately you do. That's really the core of how this differs from current Slony. Every change we make creates WAL records. Whether that is changes originating on the current node, or changes originating on upstream nodes that need to be applied on the current node. The WAL stream is then read and filtered for changes to pass onto other nodes. So we want to be able to filter out the applied changes to avoid passing them back to the original nodes. Having each record know the origin makes the filtering much simpler, so if its possible to do it efficiently then its the best design. It turns out to be the best way to do this so far known. There are other design however, as noted. In all cases we need the origin id in the WAL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi Chris! On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote: > On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund wrote: > > On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote: > >> Simon Riggs wrote: > >> > This is not transaction metadata, it is WAL record metadata > >> > required for multi-master replication, see later point. > >> > > >> > We need to add information to every WAL record that is used as the > >> > source for generating LCRs. > >> > >> If the origin ID of a transaction doesn't count as transaction > >> metadata (i.e., data about the transaction), what does? It may be a > >> metadata element about which you have special concerns, but it is > >> transaction metadata. You don't plan on supporting individual WAL > >> records within a transaction containing different values for origin > >> ID, do you? If not, why is it something to store in every WAL > >> record rather than once per transaction? That's not intended to be > >> a rhetorical question. > > > > Its definitely possible to store it per transaction (see the discussion > > around http://archives.postgresql.org/message- > > id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering > > via the originating node a considerably more complex thing. With our > > proposal you can do it without any complexity involved, on a low level. > > Storing it per transaction means you can only stream out the data to > > other nodes *after* fully reassembling the transaction. Thats a pitty, > > especially if we go for a design where the decoding happens in a proxy > > instance. > > I guess I'm not seeing the purpose to having the origin node id in the > WAL stream either. > > We have it in the Slony sl_log_* stream, however there is a crucial > difference, in that sl_log_* is expressly a shared structure. In > contrast, WAL isn't directly sharable; you don't mix together multiple > WAL streams. > > It seems as though the point in time at which you need to know the > origin ID is the moment at which you're deciding to read data from the > WAL files, and knowing which stream you are reading from is an > assertion that might be satisfied by looking at configuration that > doesn't need to be in the WAL stream itself. It might be *nice* for > the WAL stream to be self-identifying, but that doesn't seem to be > forcibly necessary. > > The case where it *would* be needful is if you are in the process of > assembling together updates coming in from multiple masters, and need > to know: >- This INSERT was replicated from node #1, so should be ignored > downstream - That INSERT was replicated from node #2, so should be ignored > downstream - This UPDATE came from the local node, so needs to be passed > to downstream users Exactly that is the point. And you want to do that in an efficient manner without too much logic, thats why something simple like the record header is so appealing. > > I also have to admit that I am very hesitant to start developing some > > generic "transaction metadata" framework atm. That seems to be a good > > way to spend a good part of time in discussion and disagreeing. Imo > > thats something for later. > Well, I see there being a use in there being at least 3 sorts of LCR > records: > a) Capturing literal SQL that is to replayed downstream > b) Capturing tuple updates in a binary form that can be turned readily > into heap updates on a replica. > c) Capturing tuple data in some reasonably portable and readily > re-writable form I think we should provide the utilities to do all of those. a) is a consequence of being able to do c). That doesn't really have something to do with this subthread though? The part you quoted above was my response to the suggestion to add some generic framework to attach metadata to individual transactions on the generating side. We quite possibly will end up needing that but I personally don't think we should designing that part atm. > b) Capturing tuple updates in a binary form that can be turned readily > into heap updates on a replica. > Unfortunately, this form is likely not to play well when > replicating across platforms or Postgres versions, so I suspect that > this performance optimization should be implemented as a *last* > resort, rather than first. Michael Jackson had some "rules of > optimization" that said "don't do it", and, for the expert, "don't do > it YET..." Well, apply is a bottleneck. Besides field experience I/We have benchmarked it and its rather plausible that it is. And I don't think we can magically make that faster in pg in general so my plan is to remove the biggest cost factor I can see. And yes, it will have restrictions... Regards, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 21 June 2012 01:40, Andres Freund wrote: >> I think extraction is a very sensible place to start; actually, I >> think it's the best possible place to start. But this particular >> thread is about adding origin_ids to WAL, which I think is definitely >> not the best place to start. > Yep. I think the reason everyone started at it is that the patch was actually > really simple ;). > Note that the wal enrichement & decoding patches were before the origin_id > patch in the patchseries ;) Actually, I thought it would be non contentious... A format change, so early in release cycle. Already openly discussed with no comment. Using wasted space for efficiency, for information already in use by Slony for similar reasons. Oh well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund wrote: >> I realized a problem with that idea this morning: it might work for >> reading things, but if anyone attempts to write data you've got big >> problems. Maybe we could get away with forbidding that, not sure. > Hm, why is writing a problem? You mean io conversion routines writing data? > Yes, that will be a problem. I am fine with simply forbidding that, we should > be able to catch that and provide a sensible error message, since SSI we have > the support for that. I think we could do something a little more vigorous than that, like maybe error out if anyone tries to do anything that would write WAL or acquire an XID. Of course, then the question becomes: then what? We probably need to think about what happens after that - we don't want an error replicating one row in one table to permanently break replication for the entire system. >> Sorry. I don't think you're planning to do something evil, but before >> I thought you said you did NOT want to write the code to extract >> changes as text or something similar. > Hm. I might have been a bit ambiguous when saying that I do not want to > provide everything for that use-case. > Once we have a callpoint that has a correct catalog snapshot for exactly the > tuple in question text conversion is damn near trivial. The point where you > get passed all that information (action, tuple, table, snapshot) is the one I > think the patch should mainly provide. This is actually a very interesting list. We could rephrase the high-level question about the design of this feature as "what is the best way to make sure that you have these things available?". Action and tuple are trivial to get, and table isn't too hard either. It's really the snapshot - and all the downstream information that can only be obtained via using that snapshot - that is the hard part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 23:56, Robert Haas wrote: > On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs wrote: > But I think getting even > single-master logical replication working well in a single release > cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? >>> >>> Oh, give me a break. You're willfully missing my point. And to quote >>> Fred Brooks, nine women can't make a baby in one month. >> >> No, I'm not. The question is not how quickly can N people achieve a >> single thing, but how long will it take a few skilled people working >> on carefully selected tasks that have few dependencies between them to >> achieve something. > > The bottleneck is getting the design right, not writing the code. > Selecting tasks for people to work on without an agreement on the > design will not advance the process, unless we just accept whatever > code you choose to right based on whatever design you happen to pick. Right. Which is why there are multiple people doing design on different aspects and much work going into prototyping to give some useful information into debates that would otherwise be bikeshedding. >> How exactly did you arrive at your conclusion? Why is yours right and >> mine wrong? > > I estimated the amount of work that would be required to do this right > and compared it to other large projects that have been successfully > done in the past. I think you are looking at something on the order > of magnitude of the Windows port, which took about four releases to > become stable, or the SE-Linux project, which still isn't > feature-complete. Even if it's only a HS-sized project, that took two > releases, as did streaming replication. SSI got committed within one > release cycle, but there were several years of design and advocacy > work before any code was written, so that, too, was really a > multi-year project. The current BDR project uses concepts that have been in discussion for many years, something like 6-7 years. I first wrote multi-master for Postgres in userspace in 2007 and the current designs build upon that. My detailed design for WAL-based logical replication was produced in 2009. I'm not really breaking new ground in the design, and we're reusing significant parts of existing code. You don't know that cos you didn't think to ask, which is a little surprising. On top of that, Andres has refined many of the basic ideas and what he presents here is a level above my initial design work. I have every confidence that he'll work independently and produce useful code while other aspects are considered. > I'll confine my comments on the second part of the question to the > observation that it is a bit early to know who is right and who is > wrong, but the question could just as easily be turned on its head. > >> No, I have four people who had initial objections and who have not >> commented on the fact that the points made are regrettably incorrect. > > I think Kevin addressed this point better than I can. Asserting > something doesn't make it true, and you haven't offered any rational > argument against the points that have been made, probably because > there isn't one. We *cannot* justify steeling 100% of the available > bit space for a feature that many people won't use and may not be > enough to address the real requirement anyway. You've some assertions above that aren't correct, so its good that you recognise assertions may not be true, on both sides. I ask that you listen a little before casting judgements. We won't get anywhere otherwise. The current wasted space on 64bit systems is 6 bytes per record. The current suggestion is to use 2 of those, in a flexible manner that allows future explansion if it is required, and only if it is required. That flexibility covers anything else we need. Asserting that "100% of the available space" is being used is wrong, and to continue to assert that even when told otherwise multiple times is something else. >> Since at least 3 of the people making such comments did not attend the >> full briefing meeting in Ottawa, I am not particularly surprised. >> However, I do expect people that didn't come to the meeting to >> recognise that they are likely to be missing information and to listen >> closely, as I listen to them. > > Participation in the community development process is not contingent > on having flown to Ottawa in May, or on having decided to spend that > evening at your briefing meeting. Attributing to ignorance what is > adequately explained by honest disagreement is impolite. I don't think people need to have attended that briefing. But if they did not, then they do need to be aware they missed hours of community discussion and explanation, built on top of months of careful investigation, all of which was aimed at helping everybody understand. Jumping straight into a discussion is what the community is about
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund wrote: > > I would prefer the event trigger way because that seems to be the most > > extensible/reusable. It would allow a fully replicated databases and > > catalog only instances. > > I think we need to design event triggers in a way you cannot simply > > circumvent them. We already have the case that if users try to screw > > around system triggers we give back wrong answers with the planner > > relying on foreign keys btw. > > If the problem is having user trigger after system triggers: Lets make > > that impossible. Forbidding DDL on the other instances once we have that > > isn't that hard. > > So, this is interesting. I think something like this could solve the > problem, but then why not just make it built-in code that runs from > the same place as the event trigger rather than using the trigger > mechanism per se? Presumably the "trigger" code's purpose is going to > be to inject additional data into the WAL stream (am I wrong?) which > is not something you're going to be able to do from PL/pgsql anyway, > so you don't really need a trigger, just a call to some C function - > which not only has the advantage of being not bypassable, but is also > faster. I would be totally fine with that. As long as event triggers provide the infrastructure that shouldn't be a big problem. > > Perhaps all that will get simpler if we can make reading the catalog via > > custom built snapshots work as you proposed otherwhere in this thread. > > That would make checking errors way much easier even if you just want to > > apply to a database with exactly the same schema. Thats the next thing I > > plan to work on. > I realized a problem with that idea this morning: it might work for > reading things, but if anyone attempts to write data you've got big > problems. Maybe we could get away with forbidding that, not sure. Hm, why is writing a problem? You mean io conversion routines writing data? Yes, that will be a problem. I am fine with simply forbidding that, we should be able to catch that and provide a sensible error message, since SSI we have the support for that. > Would be nice to get some input from other hackers on this. Oh, yes! > > I agree that the focus isn't 100% optimal and that there are *loads* of > > issues we haven't event started to look at. But you need a point to > > start and extraction & apply seems to be a good one because you can > > actually test it without the other issues solved which is not really the > > case the other way round. > > Also its possible to plug in the newly built changeset extraction into > > existing solutions to make them more efficient while retaining most of > > their respective framework. > > > > So I disagree that thats the wrong part to start with. > > I think extraction is a very sensible place to start; actually, I > think it's the best possible place to start. But this particular > thread is about adding origin_ids to WAL, which I think is definitely > not the best place to start. Yep. I think the reason everyone started at it is that the patch was actually really simple ;). Note that the wal enrichement & decoding patches were before the origin_id patch in the patchseries ;) > > I definitely do want to provide code that generates a textual > > representation of the changes. As you say, even if its not used for > > anything its needed for debugging. Not sure if it should be sql or maybe > > the new slony representation. If thats provided and reusable it should > > make sure that ontop of that other solutions can be built. > Oh, yeah. If we can get that, I will throw a party. Good ;) > > I find your supposition that I/we just want to get MMR without regard for > > anything else a bit offensive. I wrote at least three times in this > > thread that I do think its likely that we will not get more than the > > minimal basis for implementing MMR into 9.3. I wrote multiple times that > > I want to provide the basis for multiple solutions. The prototype - > > while obviously being incomplete - tried hard to be modular. > > You cannot blame us that we want the work we do to be *also* usable for > > what one of our major aims? > > What can I do to convince you/others that I am not planning to do > > something "evil" but that I try to reach as many goals at once as > > possible? > Sorry. I don't think you're planning to do something evil, but before > I thought you said you did NOT want to write the code to extract > changes as text or something similar. Hm. I might have been a bit ambiguous when saying that I do not want to provide everything for that use-case. Once we have a callpoint that has a correct catalog snapshot for exactly the tuple in question text conversion is damn near trivial. The point where you get passed all that information (action, tuple, table, snapshot) is the one I think the patch should mainly provide. > I think that
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund wrote: > I would prefer the event trigger way because that seems to be the most > extensible/reusable. It would allow a fully replicated databases and catalog > only instances. > I think we need to design event triggers in a way you cannot simply circumvent > them. We already have the case that if users try to screw around system > triggers we give back wrong answers with the planner relying on foreign keys > btw. > If the problem is having user trigger after system triggers: Lets make that > impossible. Forbidding DDL on the other instances once we have that isn't that > hard. So, this is interesting. I think something like this could solve the problem, but then why not just make it built-in code that runs from the same place as the event trigger rather than using the trigger mechanism per se? Presumably the "trigger" code's purpose is going to be to inject additional data into the WAL stream (am I wrong?) which is not something you're going to be able to do from PL/pgsql anyway, so you don't really need a trigger, just a call to some C function - which not only has the advantage of being not bypassable, but is also faster. > Perhaps all that will get simpler if we can make reading the catalog via > custom built snapshots work as you proposed otherwhere in this thread. That > would make checking errors way much easier even if you just want to apply to > a database with exactly the same schema. Thats the next thing I plan to work > on. I realized a problem with that idea this morning: it might work for reading things, but if anyone attempts to write data you've got big problems. Maybe we could get away with forbidding that, not sure. Would be nice to get some input from other hackers on this. >> They could also modify the catalogs directly, although it's possible we >> don't care quite as much about that case (but on the other hand people >> do sometimes need to do it to solve real problems). > With that you already can crash the database perfectly fine today. I think > trying to care for that is a waste of time. You're probably right. > I agree that the focus isn't 100% optimal and that there are *loads* of issues > we haven't event started to look at. But you need a point to start and > extraction & apply seems to be a good one because you can actually test it > without the other issues solved which is not really the case the other way > round. > Also its possible to plug in the newly built changeset extraction into > existing solutions to make them more efficient while retaining most of their > respective framework. > > So I disagree that thats the wrong part to start with. I think extraction is a very sensible place to start; actually, I think it's the best possible place to start. But this particular thread is about adding origin_ids to WAL, which I think is definitely not the best place to start. > I definitely do want to provide code that generates a textual representation > of the changes. As you say, even if its not used for anything its needed for > debugging. Not sure if it should be sql or maybe the new slony representation. > If thats provided and reusable it should make sure that ontop of that other > solutions can be built. Oh, yeah. If we can get that, I will throw a party. > I find your supposition that I/we just want to get MMR without regard for > anything else a bit offensive. I wrote at least three times in this thread > that I do think its likely that we will not get more than the minimal basis > for implementing MMR into 9.3. I wrote multiple times that I want to provide > the basis for multiple solutions. The prototype - while obviously being > incomplete - tried hard to be modular. > You cannot blame us that we want the work we do to be *also* usable for what > one of our major aims? > What can I do to convince you/others that I am not planning to do something > "evil" but that I try to reach as many goals at once as possible? Sorry. I don't think you're planning to do something evil, but before I thought you said you did NOT want to write the code to extract changes as text or something similar. I think that would be a really bad thing to skip for all kinds of reasons. I think we need that as a foundational technology before we do much else. Now, once we have that, if we can safely detect cases where it's OK to bypass decoding to text and skip it in just those cases, I think that's great (although possibly difficult to implement correctly). I basically feel that without decode-to-text, this can't possibly be a basis for multiple solutions; it will be a basis only for itself, and extremely difficult to debug, too. No other replication solution can even theoretically have any use for the raw on-disk tuple, at least not without horrible kludgery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To ma
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund wrote: > On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote: >> Simon Riggs wrote: >> > This is not transaction metadata, it is WAL record metadata >> > required for multi-master replication, see later point. > >> > We need to add information to every WAL record that is used as the >> > source for generating LCRs. >> If the origin ID of a transaction doesn't count as transaction >> metadata (i.e., data about the transaction), what does? It may be a >> metadata element about which you have special concerns, but it is >> transaction metadata. You don't plan on supporting individual WAL >> records within a transaction containing different values for origin >> ID, do you? If not, why is it something to store in every WAL >> record rather than once per transaction? That's not intended to be >> a rhetorical question. > Its definitely possible to store it per transaction (see the discussion around > http://archives.postgresql.org/message- > id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via > the originating node a considerably more complex thing. With our proposal you > can do it without any complexity involved, on a low level. Storing it per > transaction means you can only stream out the data to other nodes *after* > fully reassembling the transaction. Thats a pitty, especially if we go for a > design where the decoding happens in a proxy instance. I guess I'm not seeing the purpose to having the origin node id in the WAL stream either. We have it in the Slony sl_log_* stream, however there is a crucial difference, in that sl_log_* is expressly a shared structure. In contrast, WAL isn't directly sharable; you don't mix together multiple WAL streams. It seems as though the point in time at which you need to know the origin ID is the moment at which you're deciding to read data from the WAL files, and knowing which stream you are reading from is an assertion that might be satisfied by looking at configuration that doesn't need to be in the WAL stream itself. It might be *nice* for the WAL stream to be self-identifying, but that doesn't seem to be forcibly necessary. The case where it *would* be needful is if you are in the process of assembling together updates coming in from multiple masters, and need to know: - This INSERT was replicated from node #1, so should be ignored downstream - That INSERT was replicated from node #2, so should be ignored downstream - This UPDATE came from the local node, so needs to be passed to downstream users Or perhaps something else is behind the node id being deeply embedded into the stream that I'm not seeing altogether. > Other metadata will not be needed on such a low level. > > I also have to admit that I am very hesitant to start developing some generic > "transaction metadata" framework atm. That seems to be a good way to spend a > good part of time in discussion and disagreeing. Imo thats something for > later. Well, I see there being a use in there being at least 3 sorts of LCR records: a) Capturing literal SQL that is to replayed downstream. This parallels two use cases existing in existing replication systems: i) In pre-2.2 versions of Slony, statements are replayed literally. So there's a stream of INSERT/UPDATE/DELETE statements. ii) DDL capture and replay. In existing replication systems, DDL isn't captured implicitly, the way Dimitri's Event Triggers are to do, but rather is captured explicitly. There should be a function to allow injecting such SQL explicitly; that is sure to be a useful sort of thing to be able to do. b) Capturing tuple updates in a binary form that can be turned readily into heap updates on a replica. Unfortunately, this form is likely not to play well when replicating across platforms or Postgres versions, so I suspect that this performance optimization should be implemented as a *last* resort, rather than first. Michael Jackson had some "rules of optimization" that said "don't do it", and, for the expert, "don't do it YET..." c) Capturing tuple data in some reasonably portable and readily re-writable form. Slony 2.2 changes from "SQL fragments" (of a) i) above) to storing updates as an array of text values indicating: - relation name - attribute names - attribute values, serialized into strings I don't know that this provably represents the *BEST* representation, but it definitely will be portable where b) would not be, and lends itself to being able to reuse query plans, where a) requires extraordinary amounts of parsing work, today. So I'm pretty sure it's better than a) and b) for a sizable set of cases. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 23:34, Kevin Grittner wrote: > Simon Riggs wrote: >> Kevin Grittner wrote: Heikki Linnakangas wrote: >>> I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. >>> >>> Right. >> >> Wrong, as explained. > > The point is not wrong; you are simply not responding to what is > being said. Heikki said that the origin ID was not required for all MMR configs/scenarios. IMHO that is wrong, with explanation given. By agreeing with him, I assumed you were sharing that assertion, rather than saying something else. > You have not explained why an origin ID is required when there is no > replication, or if there is master/slave logical replication, or ... You're right; I never claimed it was needed. Origin Id is only needed for multi-master replication and that is the only context I've discussed it. >> This is not transaction metadata, it is WAL record metadata >> required for multi-master replication, see later point. >> >> We need to add information to every WAL record that is used as the >> source for generating LCRs. > > If the origin ID of a transaction doesn't count as transaction > metadata (i.e., data about the transaction), what does? It may be a > metadata element about which you have special concerns, but it is > transaction metadata. You don't plan on supporting individual WAL > records within a transaction containing different values for origin > ID, do you? If not, why is it something to store in every WAL > record rather than once per transaction? That's not intended to be > a rhetorical question. I think it's because you're still thinking > of the WAL stream as *the medium* for logical replication data > rather than *the source* of logical replication data. > As long as the WAL stream is the medium, options are very > constrained. You can code a very fast engine to handle a single > type of configuration that way, and perhaps that should be a > supported feature, but it's not a configuration I've needed yet. > (Well, on reflection, if it had been available and easy to use, I > can think of *one* time I *might* have used it for a pair of nodes.) > It seems to me that you are so focused on this one use case that you > are not considering how design choices which facilitate fast > development of that use case paint us into a corner in terms of > expanding to other use cases. >>> I think removing origin ID from this patch and submitting a >>> separate patch for a generalized transaction metadata system is >>> the sensible way to go. >> >> We already have a very flexible WAL system for recording data of >> interest to various resource managers. If you wish to annotate a >> transaction, you can either generate a new kind of WAL record or >> you can enhance a commit record. > > Right. Like many of us are suggesting should be done for origin ID. > >> XLOG_NOOP records can already be generated by your application if >> you wish to inject additional metadata to the WAL stream. So no >> changes are required for you to implement the generalised >> transaction metadata scheme you say you require. > > I'm glad it's that easy. Are there SQL functions to for that yet? Yes, another possible design is to generate a new kind of WAL record for the origin id. Doing it that way will slow down multi-master by a measurable amount, and slightly bloat the WAL stream. The proposed way uses space that is currently wasted and likely to remain so. Only 2 bytes of 6 bytes available are proposed for use, with a flag design that allows future extension if required. When MMR is not in use, the WAL records would look completely identical to the way they look now, in size, settings and speed of writing them. Putting the origin id onto each WAL record allows very fast and simple stateless filtering. I suggest using it because those bytes have been sitting there unused for close to 10 years now and no better use springs to mind. The proposed design is the fastest way of implementing MMR, without any loss for non-users. As I noted before, slowing down MMR by a small amount causes geometric losses in performance across the whole cluster. >> Not sure how or why that relates to requirements for multi-master. > > That depends on whether you want to leave the door open to other > logical replication than the one use case on which you are currently > focused. I even consider some of those other cases multi-master, > especially when multiple databases are replicating to a single table > on another server. I'm not clear on your definition -- it seems to > be rather more narrow. Maybe we need to define some terms somewhere > to facilitate discussion. Is there a Wiki page where that would > make sense? The project is called BiDirectional Replication to ensure that people understood this is not just multi-master. But that doesn't mean that multi-master can't have its own specific requirements. Adding origin
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund wrote: > > Were not the only ones here that are performing scope creep though... I > > think about all people who have posted in the whole thread except maybe > > Tom and Marko are guilty of doing so. > > > > I still think its rather sensible to focus on exactly duplicated schemas > > in a very first version just because that leaves out some of the > > complexity while paving the road for other nice things. > > Well, I guess what I want to know is: what does focusing on exactly > duplicated schemas mean? If it means we'll disable DDL for tables > when we turn on replication, that's basically the Slony approach: when > you want to make a DDL change, you have to quiesce replication, do it, > and then resume replication. I would possibly be OK with that > approach. If it means that we'll hope that the schemas are duplicated > and start spewing garbage data when they're not, then I'm not > definitely not OK with that approach. If it means using event > triggers to keep the catalogs synchronized, then I don't think I don't > think that's adequately robust. The user could add more event > triggers that run before or after the ones the replication system > adds, and then you are back to garbage decoding (or crashes). I would prefer the event trigger way because that seems to be the most extensible/reusable. It would allow a fully replicated databases and catalog only instances. I think we need to design event triggers in a way you cannot simply circumvent them. We already have the case that if users try to screw around system triggers we give back wrong answers with the planner relying on foreign keys btw. If the problem is having user trigger after system triggers: Lets make that impossible. Forbidding DDL on the other instances once we have that isn't that hard. Perhaps all that will get simpler if we can make reading the catalog via custom built snapshots work as you proposed otherwhere in this thread. That would make checking errors way much easier even if you just want to apply to a database with exactly the same schema. Thats the next thing I plan to work on. > They could also modify the catalogs directly, although it's possible we > don't care quite as much about that case (but on the other hand people > do sometimes need to do it to solve real problems). With that you already can crash the database perfectly fine today. I think trying to care for that is a waste of time. > Although I am > 100% OK with pairing back the initial feature set - indeed, I strongly > recommend it - I think that robustness is not a feature which can be > left out in v1 and added in later. All the robustness has to be > designed in at the start, or we will never have it. I definitely don't intend to cut down on robustness. > On the whole, I think we're spending far too much time talking about > code and far too little time talking about what the overall design > should look like. Agreed. > We are having a discussion about whether or not MMR > should be supported by sticking a 16-bit node ID into every WAL record > without having first decided whether we should support MMR, whether > that requires node IDs, whether they should be integers, whether those > integers should be 16 bits in size, whether they should be present in > WAL, and whether or not the record header is the right place to put > them. There's a right order in which to resolve those questions, and > this isn't it. More generally, I think there is a ton of complexity > that we're probably overlooking here in focusing in on specific coding > details. I think the most interesting comment made to date is Steve > Singer's observation that very little of Slony is concerned with > changeset extraction or apply. Now, on the flip side, all of these > patches seem to be concerned with changeset extraction and apply. > That suggests that we're missing some pretty significant pieces > somewhere in this design. I think those pieces are things like error > recovery, fault tolerance, user interface design, and control logic. > Slony has spent years trying to get those things right. Whether or > not they actually have gotten them right is of course an arguable > point, but we're unlikely to do better by ignoring all of those issues > and implementing whatever is most technically expedient. I agree that the focus isn't 100% optimal and that there are *loads* of issues we haven't event started to look at. But you need a point to start and extraction & apply seems to be a good one because you can actually test it without the other issues solved which is not really the case the other way round. Also its possible to plug in the newly built changeset extraction into existing solutions to make them more efficient while retaining most of their respective framework. So I disagree that thats the wrong part to start with. > >> You've got f
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs wrote: But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. >>> >>> OK, so your estimate is 1.5 people to do that. And if we have more >>> people, should they sit around doing nothing? >> >> Oh, give me a break. You're willfully missing my point. And to quote >> Fred Brooks, nine women can't make a baby in one month. > > No, I'm not. The question is not how quickly can N people achieve a > single thing, but how long will it take a few skilled people working > on carefully selected tasks that have few dependencies between them to > achieve something. The bottleneck is getting the design right, not writing the code. Selecting tasks for people to work on without an agreement on the design will not advance the process, unless we just accept whatever code you choose to right based on whatever design you happen to pick. > How exactly did you arrive at your conclusion? Why is yours right and > mine wrong? I estimated the amount of work that would be required to do this right and compared it to other large projects that have been successfully done in the past. I think you are looking at something on the order of magnitude of the Windows port, which took about four releases to become stable, or the SE-Linux project, which still isn't feature-complete. Even if it's only a HS-sized project, that took two releases, as did streaming replication. SSI got committed within one release cycle, but there were several years of design and advocacy work before any code was written, so that, too, was really a multi-year project. I'll confine my comments on the second part of the question to the observation that it is a bit early to know who is right and who is wrong, but the question could just as easily be turned on its head. > No, I have four people who had initial objections and who have not > commented on the fact that the points made are regrettably incorrect. I think Kevin addressed this point better than I can. Asserting something doesn't make it true, and you haven't offered any rational argument against the points that have been made, probably because there isn't one. We *cannot* justify steeling 100% of the available bit space for a feature that many people won't use and may not be enough to address the real requirement anyway. > Since at least 3 of the people making such comments did not attend the > full briefing meeting in Ottawa, I am not particularly surprised. > However, I do expect people that didn't come to the meeting to > recognise that they are likely to be missing information and to listen > closely, as I listen to them. Participation in the community development process is not contingent on having flown to Ottawa in May, or on having decided to spend that evening at your briefing meeting. Attributing to ignorance what is adequately explained by honest disagreement is impolite. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote: > Simon Riggs wrote: > > This is not transaction metadata, it is WAL record metadata > > required for multi-master replication, see later point. > > We need to add information to every WAL record that is used as the > > source for generating LCRs. > If the origin ID of a transaction doesn't count as transaction > metadata (i.e., data about the transaction), what does? It may be a > metadata element about which you have special concerns, but it is > transaction metadata. You don't plan on supporting individual WAL > records within a transaction containing different values for origin > ID, do you? If not, why is it something to store in every WAL > record rather than once per transaction? That's not intended to be > a rhetorical question. Its definitely possible to store it per transaction (see the discussion around http://archives.postgresql.org/message- id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via the originating node a considerably more complex thing. With our proposal you can do it without any complexity involved, on a low level. Storing it per transaction means you can only stream out the data to other nodes *after* fully reassembling the transaction. Thats a pitty, especially if we go for a design where the decoding happens in a proxy instance. Other metadata will not be needed on such a low level. I also have to admit that I am very hesitant to start developing some generic "transaction metadata" framework atm. That seems to be a good way to spend a good part of time in discussion and disagreeing. Imo thats something for later. > I think it's because you're still thinking > of the WAL stream as *the medium* for logical replication data > rather than *the source* of logical replication data. I don't think thats true. See the above referenced subthread for reasons why I think the origin id is important. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund wrote: > Were not the only ones here that are performing scope creep though... I think > about all people who have posted in the whole thread except maybe Tom and > Marko are guilty of doing so. > > I still think its rather sensible to focus on exactly duplicated schemas in a > very first version just because that leaves out some of the complexity while > paving the road for other nice things. Well, I guess what I want to know is: what does focusing on exactly duplicated schemas mean? If it means we'll disable DDL for tables when we turn on replication, that's basically the Slony approach: when you want to make a DDL change, you have to quiesce replication, do it, and then resume replication. I would possibly be OK with that approach. If it means that we'll hope that the schemas are duplicated and start spewing garbage data when they're not, then I'm not definitely not OK with that approach. If it means using event triggers to keep the catalogs synchronized, then I don't think I don't think that's adequately robust. The user could add more event triggers that run before or after the ones the replication system adds, and then you are back to garbage decoding (or crashes). They could also modify the catalogs directly, although it's possible we don't care quite as much about that case (but on the other hand people do sometimes need to do it to solve real problems). Although I am 100% OK with pairing back the initial feature set - indeed, I strongly recommend it - I think that robustness is not a feature which can be left out in v1 and added in later. All the robustness has to be designed in at the start, or we will never have it. On the whole, I think we're spending far too much time talking about code and far too little time talking about what the overall design should look like. We are having a discussion about whether or not MMR should be supported by sticking a 16-bit node ID into every WAL record without having first decided whether we should support MMR, whether that requires node IDs, whether they should be integers, whether those integers should be 16 bits in size, whether they should be present in WAL, and whether or not the record header is the right place to put them. There's a right order in which to resolve those questions, and this isn't it. More generally, I think there is a ton of complexity that we're probably overlooking here in focusing in on specific coding details. I think the most interesting comment made to date is Steve Singer's observation that very little of Slony is concerned with changeset extraction or apply. Now, on the flip side, all of these patches seem to be concerned with changeset extraction and apply. That suggests that we're missing some pretty significant pieces somewhere in this design. I think those pieces are things like error recovery, fault tolerance, user interface design, and control logic. Slony has spent years trying to get those things right. Whether or not they actually have gotten them right is of course an arguable point, but we're unlikely to do better by ignoring all of those issues and implementing whatever is most technically expedient. >> You've got four people objecting to this patch now, all of whom happen >> to be committers. Whether or not MMR goes into core, who knows, but >> it doesn't seem that this patch is going to fly. > I find that a bit too early to say. Sure it won't fly exactly as proposed, but > hell, who cares? What I want to get in is a solution to the specific problem > the patch targets. At least you have, not sure about others, accepted that the > problem needs a solution. > We do not agree yet how that solution looks should like but thats not exactly > surprising as we started discussing the problem only a good day ago. Oh, no argument with any of that. I strongly object to the idea of shoving this patch through as-is, but I don't object to solving the problem in some other, more appropriate way. I think that won't look much like this patch, though; it will be some new patch. > If people agree that your proposed way of just one flag bit is the way to go > we will have to live with that. But thats different from saying the whole > thing is dead. I think you've convinced me that a single flag-bit is not enough, but I don't think you've convinced anyone that it belongs in the record header. >> As I would rather see this project >> succeed, I recommend that you don't do that. Both you and Andres seem >> to believe that MMR is a reasonable first target to shoot at, but I >> don't think anyone else - including the Slony developers who have >> commented on this issue - endorses that position. > I don't think we get full MMR into 9.3. What I am proposing is that we build > in the few pieces that are required to implement MMR *ontop* of whats > hopefully in 9.3. > And I think thats a realistic goal. I can't quite follow that sentence, but my general sense is that, wh
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Simon Riggs wrote: > Kevin Grittner wrote: >>> Heikki Linnakangas wrote: >> >>> I don't like the idea of adding the origin id to the record >>> header. It's only required in some occasions, and on some record >>> types. >> >> Right. > > Wrong, as explained. The point is not wrong; you are simply not responding to what is being said. You have not explained why an origin ID is required when there is no replication, or if there is master/slave logical replication, or there are multiple masters with non-overlapping primary keys replicating to a single table in a consolidated database, or each master replicates to all other masters directly, or any of various other scenarios raised on this thread. You've only explained why it's necessary for certain configurations of multi-master replication where all rows in a table can be updated on any of the masters. I understand that this is the configuration you find most interesting, at least for initial implementation. That does not mean that the other situations don't exist as use cases or should be not be considered in the overall design. I don't think there is anyone here who would not love to see this effort succeed, all the way to multi-master replication in the configuration you are emphasizing. What is happening is that people are expressing concerns about parts of the design which they feel are problematic, and brainstorming about possible alternatives. As I'm sure you know, fixing a design problem at this stage in development is a lot less expensive than letting the problem slide and trying to deal with it later. > It isn't true that this is needed only for some configurations of > multi-master, per discussion. I didn't get that out of the discussion; I saw a lot of cases mentioned as not needing it to which you simply did not respond. > This is not transaction metadata, it is WAL record metadata > required for multi-master replication, see later point. > > We need to add information to every WAL record that is used as the > source for generating LCRs. If the origin ID of a transaction doesn't count as transaction metadata (i.e., data about the transaction), what does? It may be a metadata element about which you have special concerns, but it is transaction metadata. You don't plan on supporting individual WAL records within a transaction containing different values for origin ID, do you? If not, why is it something to store in every WAL record rather than once per transaction? That's not intended to be a rhetorical question. I think it's because you're still thinking of the WAL stream as *the medium* for logical replication data rather than *the source* of logical replication data. As long as the WAL stream is the medium, options are very constrained. You can code a very fast engine to handle a single type of configuration that way, and perhaps that should be a supported feature, but it's not a configuration I've needed yet. (Well, on reflection, if it had been available and easy to use, I can think of *one* time I *might* have used it for a pair of nodes.) It seems to me that you are so focused on this one use case that you are not considering how design choices which facilitate fast development of that use case paint us into a corner in terms of expanding to other use cases. >> I think removing origin ID from this patch and submitting a >> separate patch for a generalized transaction metadata system is >> the sensible way to go. > > We already have a very flexible WAL system for recording data of > interest to various resource managers. If you wish to annotate a > transaction, you can either generate a new kind of WAL record or > you can enhance a commit record. Right. Like many of us are suggesting should be done for origin ID. > XLOG_NOOP records can already be generated by your application if > you wish to inject additional metadata to the WAL stream. So no > changes are required for you to implement the generalised > transaction metadata scheme you say you require. I'm glad it's that easy. Are there SQL functions to for that yet? > Not sure how or why that relates to requirements for multi-master. That depends on whether you want to leave the door open to other logical replication than the one use case on which you are currently focused. I even consider some of those other cases multi-master, especially when multiple databases are replicating to a single table on another server. I'm not clear on your definition -- it seems to be rather more narrow. Maybe we need to define some terms somewhere to facilitate discussion. Is there a Wiki page where that would make sense? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:23, Heikki Linnakangas wrote: > On 20.06.2012 11:17, Simon Riggs wrote: >> >> On 20 June 2012 15:45, Heikki Linnakangas >> wrote: >>> >>> On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas > > > And I'm worried > it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. >>> >>> >>> So, if the origin id is not sufficient for some conflict resolution >>> mechanisms, what extra information do you need for those, and where do >>> you >>> put it? >> >> >> As explained elsewhere, wal_level = logical (or similar) would be used >> to provide any additional logical information required. >> >> Update and Delete WAL records already need to be different in that >> mode, so additional info would be placed there, if there were any. >> >> In the case of reflexive updates you raised, a typical response in >> other DBMS would be to represent the query >> UPDATE SET counter = counter + 1 >> by sending just the "+1" part, not the current value of counter, as >> would be the case with the non-reflexive update >> UPDATE SET counter = 1 >> >> Handling such things in Postgres would require some subtlety, which >> would not be resolved in first release but is pretty certain not to >> require any changes to the WAL record header as a way of resolving it. >> Having already thought about it, I'd estimate that is a very long >> discussion and not relevant to the OT, but if you wish to have it >> here, I won't stop you. > > > Yeah, I'd like to hear briefly how you would handle that without any further > changes to the WAL record header. I already did: >> Update and Delete WAL records already need to be different in that >> mode, so additional info would be placed there, if there were any. The case you mentioned relates to UPDATEs only, so I would suggest that we add that information to a new form of update record only. That has nothing to do with the WAL record header. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 21:42, Robert Haas wrote: > On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs wrote: >> On 20 June 2012 21:19, Robert Haas wrote: >>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. >>> >>> If you're saying that single-master logical replication isn't useful, >>> I disagree. Of course, having both single-master and multi-master >>> replication together is even more useful. >> >>> But I think getting even >>> single-master logical replication working well in a single release >>> cycle is going to be a job and a half. >> >> OK, so your estimate is 1.5 people to do that. And if we have more >> people, should they sit around doing nothing? > > Oh, give me a break. You're willfully missing my point. And to quote > Fred Brooks, nine women can't make a baby in one month. No, I'm not. The question is not how quickly can N people achieve a single thing, but how long will it take a few skilled people working on carefully selected tasks that have few dependencies between them to achieve something. We have significantly more preparation, development time and resources than any other project previously performed for PostgreSQL, that I am aware of. Stating that it is impossible to perform a task in a certain period of time without even considering those points is clearly rubbish. I've arrived at my thinking based upon detailed project planning of what was possible in the time. How exactly did you arrive at your conclusion? Why is yours right and mine wrong? >>> Thinking that we're going to >>> get MMR in one release is not realistic. >> >> If you block it, then the above becomes true, whether or not it starts true. > > If I had no rational basis for my objections, that would be true. > You've got four people objecting to this patch now, all of whom happen > to be committers. Whether or not MMR goes into core, who knows, but > it doesn't seem that this patch is going to fly. No, I have four people who had initial objections and who have not commented on the fact that the points made are regrettably incorrect. I don't expect everybody commenting on the design to have perfect knowledge of the whole design, so I expect people to make errors in their comments. I also expect people to take note of what has been said before making further objections or drawing conclusions. Since at least 3 of the people making such comments did not attend the full briefing meeting in Ottawa, I am not particularly surprised. However, I do expect people that didn't come to the meeting to recognise that they are likely to be missing information and to listen closely, as I listen to them. "When the facts change, I change my mind. What do you do, sir?" > My main point in bringing this up is that if you pick a project that > is too large, you will fail. As I would rather see this project > succeed, I recommend that you don't do that. Both you and Andres seem > to believe that MMR is a reasonable first target to shoot at, but I > don't think anyone else - including the Slony developers who have > commented on this issue - endorses that position. At PGCon, you were > talking about getting a new set of features into PG over the next 3-5 > years. Now, it seems like you want to compress that timeline to a > year. I don't think that's going to work. You also requested that > people tell you sooner when large patches were in danger of not making > the release. Now I'm doing that, VERY early, and you're apparently > angry about it. If the only satisfactory outcome of this conversation > is that everyone agrees with the design pattern you've already decided > on, then you haven't left yourself very much room to walk away > satisfied. Note that I have already myself given review feedback to Andres and that change has visibly occurred during this thread via public debate. Claiming that I only stick to what has already been decided is patently false, with me at least. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund wrote: > >> If you do that, then, yes, > >> everything that you need to disentangle various network topologies > >> must be present in WAL. But what I'm saying is: don't do it like > >> that. Generate the LCRs just ONCE, at the origin node, and then pass > >> them around the network, applying them at every node. Then, the > >> information that is needed in WAL is confined to one bit: the > >> knowledge of whether or not a particular transaction is local (and > >> thus LCRs should be generated) or non-local (and thus they shouldn't, > >> because the origin already generated them and thus we're just handing > >> them around to apply everywhere). > > > > Sure, you can do it that way, but I don't think its a good idea. If you > > do it my way you *guarantee* that when replaying changes from node B on > > node C you have replayed changes from A at least as far as B has. Thats > > a really nice property for MM. > > You *can* get same with your solution but it starts to get complicated > > rather fast. While my/our proposed solution is trivial to implement. > > That's an interesting point. I agree that's a useful property, and > might be a reason not to just use a single-bit flag, but I still think > you'd be better off handling that requirement via some other method, > like logging the node ID once per transaction or something. That lets > you have as much metadata as you end up needing, which is a lot more > flexible than a 16-bit field, as Kevin, Heikki, and Tom have also > said. If it comes down to that I can definitely live with that. I still think making the filtering trivial so it can be done without any logic on a low level is a very desirable property but if not, so be it. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs wrote: > > On 20 June 2012 21:19, Robert Haas wrote: > >> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: > >>> The idea that logical rep is some kind of useful end goal in itself is > >>> slightly misleading. If the thought is to block multi-master > >>> completely on that basis, that would be a shame. Logical rep is the > >>> mechanism for implementing multi-master. > >> > >> If you're saying that single-master logical replication isn't useful, > >> I disagree. Of course, having both single-master and multi-master > >> replication together is even more useful. > >> > >> But I think getting even > >> single-master logical replication working well in a single release > >> cycle is going to be a job and a half. > >> Thinking that we're going to > >> get MMR in one release is not realistic. > > If you block it, then the above becomes true, whether or not it starts > > true. > My main point in bringing this up is that if you pick a project that > is too large, you will fail. Were not the only ones here that are performing scope creep though... I think about all people who have posted in the whole thread except maybe Tom and Marko are guilty of doing so. I still think its rather sensible to focus on exactly duplicated schemas in a very first version just because that leaves out some of the complexity while paving the road for other nice things. > You've got four people objecting to this patch now, all of whom happen > to be committers. Whether or not MMR goes into core, who knows, but > it doesn't seem that this patch is going to fly. I find that a bit too early to say. Sure it won't fly exactly as proposed, but hell, who cares? What I want to get in is a solution to the specific problem the patch targets. At least you have, not sure about others, accepted that the problem needs a solution. We do not agree yet how that solution looks should like but thats not exactly surprising as we started discussing the problem only a good day ago. If people agree that your proposed way of just one flag bit is the way to go we will have to live with that. But thats different from saying the whole thing is dead. > As I would rather see this project > succeed, I recommend that you don't do that. Both you and Andres seem > to believe that MMR is a reasonable first target to shoot at, but I > don't think anyone else - including the Slony developers who have > commented on this issue - endorses that position. I don't think we get full MMR into 9.3. What I am proposing is that we build in the few pieces that are required to implement MMR *ontop* of whats hopefully in 9.3. And I think thats a realistic goal. > At PGCon, you were > talking about getting a new set of features into PG over the next 3-5 > years. Now, it seems like you want to compress that timeline to a > year. Well, I obviously would like to be everything be done in a release, but I also would like to go hiking for a year, have a restored sailing boat and some more. That doesn't make it reality... To make it absolutely clear: I definitely don't think its realistic to have everything in 9.3. And I don't think Simon does so either. What I want is to have the basic building blocks in 9.3. The difference probably just is whats determined as a building block. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund wrote: >> If you do that, then, yes, >> everything that you need to disentangle various network topologies >> must be present in WAL. But what I'm saying is: don't do it like >> that. Generate the LCRs just ONCE, at the origin node, and then pass >> them around the network, applying them at every node. Then, the >> information that is needed in WAL is confined to one bit: the >> knowledge of whether or not a particular transaction is local (and >> thus LCRs should be generated) or non-local (and thus they shouldn't, >> because the origin already generated them and thus we're just handing >> them around to apply everywhere). > Sure, you can do it that way, but I don't think its a good idea. If you do it > my way you *guarantee* that when replaying changes from node B on node C you > have replayed changes from A at least as far as B has. Thats a really nice > property for MM. > You *can* get same with your solution but it starts to get complicated rather > fast. While my/our proposed solution is trivial to implement. That's an interesting point. I agree that's a useful property, and might be a reason not to just use a single-bit flag, but I still think you'd be better off handling that requirement via some other method, like logging the node ID once per transaction or something. That lets you have as much metadata as you end up needing, which is a lot more flexible than a 16-bit field, as Kevin, Heikki, and Tom have also said. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/19/2012 01:47 AM, Christopher Browne wrote: > That numbering scheme gets pretty anti-intuitive fairly quickly, > from whence we took the approach of having a couple digits > indicating data centre followed by a digit indicating which node in > that data centre. > > If that all sounds incoherent, well, the more nodes you have > around, the more difficult it becomes to make sure you *do* have a > coherent picture of your cluster. > > I recall the Slony-II project having a notion of attaching a > permanent UUID-based node ID to each node. As long as there is > somewhere decent to find a symbolically significant node "name," I > like the idea of the ID *not* being in a tiny range, and being > UUID/OID-like... > Just as a sidenote, MySQL's new global transaction ids use a UUID for the serverid bit of it. [1] - - Hannu Valtonen [1] http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T 4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119 mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4 SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1 adx5JGZK5k0xtSewTj7O =lK2f -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 20:37, Kevin Grittner wrote: >> Heikki Linnakangas wrote: > >> I don't like the idea of adding the origin id to the record header. >> It's only required in some occasions, and on some record types. > > Right. Wrong, as explained. >> And I'm worried it might not even be enough in more complicated >> scenarios. >> >> Perhaps we need a more generic WAL record annotation system, where >> a plugin can tack arbitrary information to WAL records. The extra >> information could be stored in the WAL record after the rmgr >> payload, similar to how backup blocks are stored. WAL replay could >> just ignore the annotations, but a replication system could use it >> to store the origin id or whatever extra information it needs. > > Not only would that handle absolute versus relative updates and > origin id, but application frameworks could take advantage of such a > system for passing transaction metadata. I've held back on one > concern so far that I'll bring up now because this suggestion would > address it nicely. > > Our current trigger-driven logical replication includes a summary > which includes transaction run time, commit time, the transaction > type identifier, the source code line from which that transaction was > invoked, the user ID with which the user connected to the application > (which isn't the same as the database login), etc. Being able to > "decorate" a database transaction with arbitrary (from the DBMS POV) > metadata would be very valuable. In fact, our shop can't maintain > the current level of capabilities without *some* way to associate > such information with a transaction. > I think that using up the only unused space in the fixed header to > capture one piece of the transaction metadata needed for logical > replication, and that only in some configurations, is short-sighted. > If we solve the general problem of transaction metadata, this one > specific case will fall out of that. The proposal now includes flag bits that would allow the addition of a variable length header, should that ever become necessary. So the unused space in the fixed header is not being "used up" as you say. In any case, the fixed header still has 4 wasted bytes on 64bit systems even after the patch is applied. So this claim of short sightedness is just plain wrong. It isn't true that this is needed only for some configurations of multi-master, per discussion. This is not transaction metadata, it is WAL record metadata required for multi-master replication, see later point. We need to add information to every WAL record that is used as the source for generating LCRs. It is also possible to add this to HEAP and HEAP2 records, but doing that *will* bloat the WAL stream, whereas using the *currently wasted* bytes on a WAL record header does *not* bloat the WAL stream. > I think removing origin ID from this patch and submitting a separate > patch for a generalized transaction metadata system is the sensible > way to go. We already have a very flexible WAL system for recording data of interest to various resource managers. If you wish to annotate a transaction, you can either generate a new kind of WAL record or you can enhance a commit record. There are already unused flag bits on commit records for just such a purpose. XLOG_NOOP records can already be generated by your application if you wish to inject additional metadata to the WAL stream. So no changes are required for you to implement the generalised transaction metadata scheme you say you require. Not sure how or why that relates to requirements for multi-master. Please note that I've suggested review changes to Andres' work myself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund wrote: > > One bit is fine if you have only very simple replication topologies. Once > > you think about globally distributed databases its a bit different. You > > describe some of that below, but just to reiterate: > > Imagine having 6 nodes, 3 on one of two continents (ABC in north america, > > DEF in europe). You may only want to have full intercontinental > > interconnect between two of those (say A and D). If you only have one > > bit to represent the origin thats not going to work because you won't be > > able discern the changes from BC on A from the changes from those > > originating on DEF. > > I don't see the problem. A certainly knows via which link the LCRs > arrived. > So: change happens on A. A sends the change to B, C, and D. B and C > apply the change. One bit is enough to keep them from regenerating > new LCRs that get sent back to A. So they're fine. D also receives > the changes (from A) and applies them, but it also does not need to > regenerate LCRs. Instead, it can take the LCRs that it has already > got (from A) and send those to E and F. > Or: change happens on B. B sends the changes to A. Since A knows the > network topology, it sends the changes to C and D. D sends them to E > and F. Nobody except B needs to *generate* LCRs. All any other node > needs to do is suppress *redundant* LCR generation. > > > Another topology which is interesting is circular replications (i.e. > > changes get shipped A->B, B->C, C->A) which is a sensible topology if > > you only have a low change rate and a relatively high number of nodes > > because you don't need the full combinatorial amount of connections. > > I think this one is OK too. You just generate LCRs on the origin node > and then pass them around the ring at every step. When the next hop > would be the origin node then you're done. > > I think you may be imagining that A generates LCRs and sends them to > B. B applies them, and then from the WAL just generated, it produces > new LCRs which then get sent to C. Yes, thats what I am proposing. > If you do that, then, yes, > everything that you need to disentangle various network topologies > must be present in WAL. But what I'm saying is: don't do it like > that. Generate the LCRs just ONCE, at the origin node, and then pass > them around the network, applying them at every node. Then, the > information that is needed in WAL is confined to one bit: the > knowledge of whether or not a particular transaction is local (and > thus LCRs should be generated) or non-local (and thus they shouldn't, > because the origin already generated them and thus we're just handing > them around to apply everywhere). Sure, you can do it that way, but I don't think its a good idea. If you do it my way you *guarantee* that when replaying changes from node B on node C you have replayed changes from A at least as far as B has. Thats a really nice property for MM. You *can* get same with your solution but it starts to get complicated rather fast. While my/our proposed solution is trivial to implement. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs wrote: > On 20 June 2012 21:19, Robert Haas wrote: >> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: >>> The idea that logical rep is some kind of useful end goal in itself is >>> slightly misleading. If the thought is to block multi-master >>> completely on that basis, that would be a shame. Logical rep is the >>> mechanism for implementing multi-master. >> >> If you're saying that single-master logical replication isn't useful, >> I disagree. Of course, having both single-master and multi-master >> replication together is even more useful. > >> But I think getting even >> single-master logical replication working well in a single release >> cycle is going to be a job and a half. > > OK, so your estimate is 1.5 people to do that. And if we have more > people, should they sit around doing nothing? Oh, give me a break. You're willfully missing my point. And to quote Fred Brooks, nine women can't make a baby in one month. >> Thinking that we're going to >> get MMR in one release is not realistic. > > If you block it, then the above becomes true, whether or not it starts true. If I had no rational basis for my objections, that would be true. You've got four people objecting to this patch now, all of whom happen to be committers. Whether or not MMR goes into core, who knows, but it doesn't seem that this patch is going to fly. My main point in bringing this up is that if you pick a project that is too large, you will fail. As I would rather see this project succeed, I recommend that you don't do that. Both you and Andres seem to believe that MMR is a reasonable first target to shoot at, but I don't think anyone else - including the Slony developers who have commented on this issue - endorses that position. At PGCon, you were talking about getting a new set of features into PG over the next 3-5 years. Now, it seems like you want to compress that timeline to a year. I don't think that's going to work. You also requested that people tell you sooner when large patches were in danger of not making the release. Now I'm doing that, VERY early, and you're apparently angry about it. If the only satisfactory outcome of this conversation is that everyone agrees with the design pattern you've already decided on, then you haven't left yourself very much room to walk away satisfied. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 21:19, Robert Haas wrote: > On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: >> The idea that logical rep is some kind of useful end goal in itself is >> slightly misleading. If the thought is to block multi-master >> completely on that basis, that would be a shame. Logical rep is the >> mechanism for implementing multi-master. > > If you're saying that single-master logical replication isn't useful, > I disagree. Of course, having both single-master and multi-master > replication together is even more useful. > But I think getting even > single-master logical replication working well in a single release > cycle is going to be a job and a half. OK, so your estimate is 1.5 people to do that. And if we have more people, should they sit around doing nothing? > Thinking that we're going to > get MMR in one release is not realistic. If you block it, then the above becomes true, whether or not it starts true. You may not want MMR, but others do. I see no reason to prevent people from having it, which is what you suggest. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote: > On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: > > The idea that logical rep is some kind of useful end goal in itself is > > slightly misleading. If the thought is to block multi-master > > completely on that basis, that would be a shame. Logical rep is the > > mechanism for implementing multi-master. > > If you're saying that single-master logical replication isn't useful, > I disagree. Of course, having both single-master and multi-master > replication together is even more useful. But I think getting even > single-master logical replication working well in a single release > cycle is going to be a job and a half. Thinking that we're going to > get MMR in one release is not realistic. The only way to make it > realistic is to put MMR ahead of every other goal that people have for > logical replication, including robustness and stability. It's > entirely premature to be designing features for MMR when we don't even > have the design for SMR nailed down yet. And that's even assuming we > EVER want MMR in core, which has not even really been argued, let > alone agreed. I agree it has not been agreed uppon, but I certainly would consider submitting a prototype implementing it an argument for doing it ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs wrote: > The idea that logical rep is some kind of useful end goal in itself is > slightly misleading. If the thought is to block multi-master > completely on that basis, that would be a shame. Logical rep is the > mechanism for implementing multi-master. If you're saying that single-master logical replication isn't useful, I disagree. Of course, having both single-master and multi-master replication together is even more useful. But I think getting even single-master logical replication working well in a single release cycle is going to be a job and a half. Thinking that we're going to get MMR in one release is not realistic. The only way to make it realistic is to put MMR ahead of every other goal that people have for logical replication, including robustness and stability. It's entirely premature to be designing features for MMR when we don't even have the design for SMR nailed down yet. And that's even assuming we EVER want MMR in core, which has not even really been argued, let alone agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund wrote: > As I said before, I definitely agree that we want to have a separate transport > format once we have decoding nailed down. We still need to ship wal around if > the decoding happens in a different instance, but *after* that it can be > shipped in something more convenient/appropriate. Right, OK, we agree on this then. > One bit is fine if you have only very simple replication topologies. Once you > think about globally distributed databases its a bit different. You describe > some of that below, but just to reiterate: > Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF > in europe). You may only want to have full intercontinental interconnect > between two of those (say A and D). If you only have one bit to represent the > origin thats not going to work because you won't be able discern the changes > from BC on A from the changes from those originating on DEF. I don't see the problem. A certainly knows via which link the LCRs arrived. So: change happens on A. A sends the change to B, C, and D. B and C apply the change. One bit is enough to keep them from regenerating new LCRs that get sent back to A. So they're fine. D also receives the changes (from A) and applies them, but it also does not need to regenerate LCRs. Instead, it can take the LCRs that it has already got (from A) and send those to E and F. Or: change happens on B. B sends the changes to A. Since A knows the network topology, it sends the changes to C and D. D sends them to E and F. Nobody except B needs to *generate* LCRs. All any other node needs to do is suppress *redundant* LCR generation. > Another topology which is interesting is circular replications (i.e. changes > get shipped A->B, B->C, C->A) which is a sensible topology if you only have a > low change rate and a relatively high number of nodes because you don't need > the full combinatorial amount of connections. I think this one is OK too. You just generate LCRs on the origin node and then pass them around the ring at every step. When the next hop would be the origin node then you're done. I think you may be imagining that A generates LCRs and sends them to B. B applies them, and then from the WAL just generated, it produces new LCRs which then get sent to C. If you do that, then, yes, everything that you need to disentangle various network topologies must be present in WAL. But what I'm saying is: don't do it like that. Generate the LCRs just ONCE, at the origin node, and then pass them around the network, applying them at every node. Then, the information that is needed in WAL is confined to one bit: the knowledge of whether or not a particular transaction is local (and thus LCRs should be generated) or non-local (and thus they shouldn't, because the origin already generated them and thus we're just handing them around to apply everywhere). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
> Heikki Linnakangas wrote: > I don't like the idea of adding the origin id to the record header. > It's only required in some occasions, and on some record types. Right. > And I'm worried it might not even be enough in more complicated > scenarios. > > Perhaps we need a more generic WAL record annotation system, where > a plugin can tack arbitrary information to WAL records. The extra > information could be stored in the WAL record after the rmgr > payload, similar to how backup blocks are stored. WAL replay could > just ignore the annotations, but a replication system could use it > to store the origin id or whatever extra information it needs. Not only would that handle absolute versus relative updates and origin id, but application frameworks could take advantage of such a system for passing transaction metadata. I've held back on one concern so far that I'll bring up now because this suggestion would address it nicely. Our current trigger-driven logical replication includes a summary which includes transaction run time, commit time, the transaction type identifier, the source code line from which that transaction was invoked, the user ID with which the user connected to the application (which isn't the same as the database login), etc. Being able to "decorate" a database transaction with arbitrary (from the DBMS POV) metadata would be very valuable. In fact, our shop can't maintain the current level of capabilities without *some* way to associate such information with a transaction. I think that using up the only unused space in the fixed header to capture one piece of the transaction metadata needed for logical replication, and that only in some configurations, is short-sighted. If we solve the general problem of transaction metadata, this one specific case will fall out of that. I think removing origin ID from this patch and submitting a separate patch for a generalized transaction metadata system is the sensible way to go. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:44, Heikki Linnakangas wrote: > On 20.06.2012 11:34, Simon Riggs wrote: >> >> On 20 June 2012 16:23, Heikki Linnakangas >> wrote: >> >>> It's only needed for multi-master replication, where the same table can >>> be >>> updated from multiple nodes. Just leave that out for now. There's plenty >>> of >>> functionality and issues left even without that. >> >> >> Huh? Multi-master replication is what is being built here and many >> people want that. > > > Sure, but presumably you're going to implement master-slave first, and build > multi-master on top of that. What I'm saying is that we can leave out the > origin-id for now, since we don't have agreement on it, and revisit it after > master-slave replication is working. I am comfortable with the idea of deferring applying the patch, but I don't see any need to defer agreeing the patch is OK, so it can be applied easily later. It does beg the question of when exactly we would defer it to though. When would that be? If you have a reason for disagreement, please raise it now, having seen explanations/comments on various concerns. Of course, people have made initial objections, which is fine but its not reasonable to assume that such complaints continue to exist after. Perhaps there are other thoughts? The idea that logical rep is some kind of useful end goal in itself is slightly misleading. If the thought is to block multi-master completely on that basis, that would be a shame. Logical rep is the mechanism for implementing multi-master. Deferring this could easily end up with a huge patch in last CF, and then it will be rejected/deferred. Patch submission here is following the requested process - as early as possible, production ready, small meaningful patches that build towards a final goal. This is especially true for format changes, which is why this patch is here now. Doing it differently just makes patch wrangling and review more difficult, which reduces overall quality and slows down development. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote: > On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne wrote: > > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas wrote: > >>> Btw, what do you mean with "conflating" the stream? I don't really see > >>> that being proposed. > >> > >> It seems to me that you are intent on using the WAL stream as the > >> logical change stream. I think that's a bad design. Instead, you > >> should extract changes from WAL and then ship them around in a format > >> that is specific to logical replication. > > > > Yeah, that seems worth elaborating on. > > > > What has been said several times is that it's pretty necessary to > > capture the logical changes into WAL. That seems pretty needful, in > > order that the replication data gets fsync()ed avidly, and so that we > > don't add in the race condition of needing to fsync() something *else* > > almost exactly as avidly as is the case for WAL today.. > > Check. > > > But it's undesirable to pull *all* the bulk of contents of WAL around > > if it's only part of the data that is going to get applied. On a > > "physical streaming" replica, any logical data that gets captured will > > be useless. And on a "logical replica," they "physical" bits of WAL > > will be useless. > > > > What I *want* you to mean is that there would be: > > a) WAL readers that pull the "physical bits", and > > b) WAL readers that just pull "logical bits." > > > > I expect it would be fine to have a tool that pulls LCRs out of WAL to > > prepare that to be sent to remote locations. Is that what you have in > > mind? > Yes. I think it should be possible to generate LCRs from WAL, but I > think that the on-the-wire format for LCRs should be different from > the WAL format. Trying to use the same format for both things seems > like an unpleasant straightjacket. This discussion illustrates why: > we're talking about consuming scarce bit-space in WAL records for a > feature that only a tiny minority of users will use, and it's still > not really enough bit space. That stinks. If LCR transmission is a > separate protocol, this problem can be engineered away at a higher > level. As I said before, I definitely agree that we want to have a separate transport format once we have decoding nailed down. We still need to ship wal around if the decoding happens in a different instance, but *after* that it can be shipped in something more convenient/appropriate. > Suppose we have three servers, A, B, and C, that are doing > multi-master replication in a loop. A sends LCRs to B, B sends them > to C, and C sends them back to A. Obviously, we need to make sure > that each server applies each set of changes just once, but it > suffices to have enough information in WAL to distinguish between > replication transactions and non-replication transactions - that is, > one bit. So suppose a change is made on server A. A generates LCRs > from WAL, and tags each LCR with node_id = A. It then sends those > LCRs to B. B applies them, flagging the apply transaction in WAL as a > replication transaction, AND ALSO sends the LCRs to C. The LCR > generator on B sees the WAL from apply, but because it's flagged as a > replication transaction, it does not generate LCRs. So C receives > LCRs from B just once, without any need for the node_id to to be known > in WAL. C can now also apply those LCRs (again flagging the apply > transaction as replication) and it can also skip sending them to A, > because it seems that they originated at A. One bit is fine if you have only very simple replication topologies. Once you think about globally distributed databases its a bit different. You describe some of that below, but just to reiterate: Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF in europe). You may only want to have full intercontinental interconnect between two of those (say A and D). If you only have one bit to represent the origin thats not going to work because you won't be able discern the changes from BC on A from the changes from those originating on DEF. Another topology which is interesting is circular replications (i.e. changes get shipped A->B, B->C, C->A) which is a sensible topology if you only have a low change rate and a relatively high number of nodes because you don't need the full combinatorial amount of connections. You still have the origin_id's be meaningful in the local context though. As described before, in the communication between the different nodes you can simply replace 16bit node id with some fancy UUID or such. And do the reverse when replaying LCRs. > Now suppose we have a more complex topology. Suppose we have a > cluster of four servers A .. D which, for improved tolerance against > network outages, are all connected pairwise. Normally all the links > are up, so each server sends all the LCRs it generates directly to all > other servers. But how do we prevent cycles? A gen
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 11:34, Simon Riggs wrote: On 20 June 2012 16:23, Heikki Linnakangas wrote: It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. Huh? Multi-master replication is what is being built here and many people want that. Sure, but presumably you're going to implement master-slave first, and build multi-master on top of that. What I'm saying is that we can leave out the origin-id for now, since we don't have agreement on it, and revisit it after master-slave replication is working. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 16:23, Heikki Linnakangas wrote: > It's only needed for multi-master replication, where the same table can be > updated from multiple nodes. Just leave that out for now. There's plenty of > functionality and issues left even without that. Huh? Multi-master replication is what is being built here and many people want that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 11:17, Simon Riggs wrote: On 20 June 2012 15:45, Heikki Linnakangas wrote: On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the "+1" part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. Yeah, I'd like to hear briefly how you would handle that without any further changes to the WAL record header. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. If that's all we can discuss here is, and all other options are off the table, then I'll have to just outright object to this patch. Let's implement what we can without the origin id, and revisit this later. As explained, we can do nothing without the origin id. It is not optional or avoidable in the way you've described. It's only needed for multi-master replication, where the same table can be updated from multiple nodes. Just leave that out for now. There's plenty of functionality and issues left even without that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 15:45, Heikki Linnakangas wrote: > On 20.06.2012 10:32, Simon Riggs wrote: >> >> On 20 June 2012 14:40, Heikki Linnakangas >>> >>> And I'm worried >>> >>> it might not even be enough in more complicated scenarios. >> >> >> It is not the only required conflict mechanism, and has never been >> claimed to be so. It is simply one piece of information needed, at >> various times. > > > So, if the origin id is not sufficient for some conflict resolution > mechanisms, what extra information do you need for those, and where do you > put it? As explained elsewhere, wal_level = logical (or similar) would be used to provide any additional logical information required. Update and Delete WAL records already need to be different in that mode, so additional info would be placed there, if there were any. In the case of reflexive updates you raised, a typical response in other DBMS would be to represent the query UPDATE SET counter = counter + 1 by sending just the "+1" part, not the current value of counter, as would be the case with the non-reflexive update UPDATE SET counter = 1 Handling such things in Postgres would require some subtlety, which would not be resolved in first release but is pretty certain not to require any changes to the WAL record header as a way of resolving it. Having already thought about it, I'd estimate that is a very long discussion and not relevant to the OT, but if you wish to have it here, I won't stop you. >>> Perhaps we need a more generic WAL record annotation system, where a >>> plugin >>> can tack arbitrary information to WAL records. The extra information >>> could >>> be stored in the WAL record after the rmgr payload, similar to how backup >>> blocks are stored. WAL replay could just ignore the annotations, but a >>> replication system could use it to store the origin id or whatever extra >>> information it needs. >> >> >> Additional information required by logical information will be handled >> by a new wal_level. >> >> The discussion here is about adding origin_node_id *only*, which needs >> to be added on each WAL record. > > > If that's all we can discuss here is, and all other options are off the > table, then I'll have to just outright object to this patch. Let's implement > what we can without the origin id, and revisit this later. As explained, we can do nothing without the origin id. It is not optional or avoidable in the way you've described. We have the choice to add the required information as a static or as a variable length addition to the WAL record header. Since there is no additional requirement for expansion of the header at this point and no measurable harm in doing so, I suggest we avoid the complexity and lack of robustness that a variable length header would cause. Of course, we can go there later if needed, but there is no current need. If a variable length header had been suggested, it is certain somebody would say that was overcooked and just do it as static. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 10:32, Simon Riggs wrote: On 20 June 2012 14:40, Heikki Linnakangas And I'm worried it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. So, if the origin id is not sufficient for some conflict resolution mechanisms, what extra information do you need for those, and where do you put it? Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. If that's all we can discuss here is, and all other options are off the table, then I'll have to just outright object to this patch. Let's implement what we can without the origin id, and revisit this later. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 14:40, Heikki Linnakangas wrote: > The reason we need an origin id in this scenario is that otherwise this will > happen: > > 1. A row is updated on node A > 2. Node B receives the WAL record from A, and updates the corresponding row > in B. This generates a new WAL record. > 3. Node A receives the WAL record from B, and updates the rows again. This > again generates a new WAL record, which is replicated to A, and you loop > indefinitely. > > If each WAL record carries an origin id, node A can use it to refrain from > applying the WAL record it receives from B, which breaks the loop. > > However, note that in this simple scenario, if the logical log replay / > conflict resolution is smart enough to recognize that the row has already > been updated, because the old and the new rows are identical, the loop is > broken at step 3 even without the origin id. That works for the > newest-update-wins and similar strategies. So the origin id is not > absolutely necessary in this case. Including the origin id in the WAL allows us to filter out WAL records when we generate LCRs, so we can completely avoid step 3, including all of the CPU, disk and network overhead that implies. Simply put, we know the change came from A, so no need to send the change to A again. If we do allow step 3 to exist, we still need to send the origin id. This is because there is no apriori way to know the origin id is not required, such as the case where we have concurrent updates which is effectively a race condition between actions on separate nodes. Not sending the origin id because it is not required in some cases is equivalent to saying we can skip locking because a race condition does not happen in all cases. Making a case that the race condition is rare is still not a case for skipping locking. Same here: we need the information to avoid making errors in the general case. > Another interesting scenario is that you maintain a global counter, like in > an inventory system, and conflicts are resolved by accumulating the updates. > For example, if you do "UPDATE SET counter = counter + 1" simultaneously on > two nodes, the result is that the counter is incremented by two. The > avoid-update-if-already-identical optimization doesn't help in this case, > the origin id is necessary. > > Now, let's take the inventory system example further. There are actually two > ways to update a counter. One is when an item is checked in or out of the > warehouse, ie. "UPDATE counter = counter + 1". Those updates should > accumulate. But another operation resets the counter to a specific value, > "UPDATE counter = 10", like when taking an inventory. That should not > accumulate with other changes, but should be newest-update-wins. The origin > id is not enough for that, because by looking at the WAL record and the > origin id, you don't know which type of an update it was. Yes, of course. Conflict handling in the general case requires much additional work. This thread is one minor change of many related changes. The patches are being submitted in smaller chunks to ease review, so sometimes there are cross links between things. > So, I don't like the idea of adding the origin id to the record header. It's > only required in some occasions, and on some record types. That conclusion doesn't follow from your stated arguments. > And I'm worried > it might not even be enough in more complicated scenarios. It is not the only required conflict mechanism, and has never been claimed to be so. It is simply one piece of information needed, at various times. > Perhaps we need a more generic WAL record annotation system, where a plugin > can tack arbitrary information to WAL records. The extra information could > be stored in the WAL record after the rmgr payload, similar to how backup > blocks are stored. WAL replay could just ignore the annotations, but a > replication system could use it to store the origin id or whatever extra > information it needs. Additional information required by logical information will be handled by a new wal_level. The discussion here is about adding origin_node_id *only*, which needs to be added on each WAL record. One question I raised in my review was whether this extra information should be added by a variable length header, so I already asked this very question. So far there is no evidence that the additional code complexity would be warranted. If it became so in the future, it can be modified again. At this stage there is no need, so the proposal is to add the field to every WAL record without regard to the setting of wal_level because there is no measurable downside to doing so. The downsides of additional complexity are clear and real however, so I wish to avoid them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http:/
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 08:35, Robert Haas wrote: >> I expect it would be fine to have a tool that pulls LCRs out of WAL to >> prepare that to be sent to remote locations. Is that what you have in >> mind? > > Yes. I think it should be possible to generate LCRs from WAL, but I > think that the on-the-wire format for LCRs should be different from > the WAL format. Trying to use the same format for both things seems > like an unpleasant straightjacket. You're confusing things here. Yes, we can use a different on-the-wire format. No problem. As I've already said, the information needs to be in WAL first before we can put it into LCRs, so the "don't use same format" argument is not relevant as to why the info must be on the WAL record in the first place. And as said elsewhere, doing that does not cause problems in any of these areas: wal bloat, performance, long term restrictions on numbers of nodeids, as has so far been claimed on this thread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20.06.2012 01:27, Kevin Grittner wrote: Andres Freund wrote: Yes, thats definitely a valid use-case. But that doesn't preclude the other - also not uncommon - use-case where you want to have different master which all contain up2date data. I agree. I was just saying that while one requires an origin_id, the other doesn't. And those not doing MM replication definitely don't need it. I think it would be helpful to list down a few concrete examples of this. The stereotypical multi-master scenario is that you have a single table that's replicated to two servers, and you can insert/update/delete on either server. Conflict resolution stretegies vary. The reason we need an origin id in this scenario is that otherwise this will happen: 1. A row is updated on node A 2. Node B receives the WAL record from A, and updates the corresponding row in B. This generates a new WAL record. 3. Node A receives the WAL record from B, and updates the rows again. This again generates a new WAL record, which is replicated to A, and you loop indefinitely. If each WAL record carries an origin id, node A can use it to refrain from applying the WAL record it receives from B, which breaks the loop. However, note that in this simple scenario, if the logical log replay / conflict resolution is smart enough to recognize that the row has already been updated, because the old and the new rows are identical, the loop is broken at step 3 even without the origin id. That works for the newest-update-wins and similar strategies. So the origin id is not absolutely necessary in this case. Another interesting scenario is that you maintain a global counter, like in an inventory system, and conflicts are resolved by accumulating the updates. For example, if you do "UPDATE SET counter = counter + 1" simultaneously on two nodes, the result is that the counter is incremented by two. The avoid-update-if-already-identical optimization doesn't help in this case, the origin id is necessary. Now, let's take the inventory system example further. There are actually two ways to update a counter. One is when an item is checked in or out of the warehouse, ie. "UPDATE counter = counter + 1". Those updates should accumulate. But another operation resets the counter to a specific value, "UPDATE counter = 10", like when taking an inventory. That should not accumulate with other changes, but should be newest-update-wins. The origin id is not enough for that, because by looking at the WAL record and the origin id, you don't know which type of an update it was. So, I don't like the idea of adding the origin id to the record header. It's only required in some occasions, and on some record types. And I'm worried it might not even be enough in more complicated scenarios. Perhaps we need a more generic WAL record annotation system, where a plugin can tack arbitrary information to WAL records. The extra information could be stored in the WAL record after the rmgr payload, similar to how backup blocks are stored. WAL replay could just ignore the annotations, but a replication system could use it to store the origin id or whatever extra information it needs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 11:26, Tom Lane wrote: > "Kevin Grittner" writes: >> Simon Riggs wrote: >>> The proposal is to use WAL to generate the logical change stream. >>> That has been shown in testing to be around x4 faster than having >>> a separate change stream, which must also be WAL logged (as Jan >>> noted). > >> Sure, that's why I want it. > > I think this argument is basically circular. The reason it's 4x faster > is that the WAL stream doesn't actually contain all the information > needed to generate LCRs (thus all the angst about maintaining catalogs > in sync, what to do about unfriendly datatypes, etc). By the time the > dust has settled and you have a workable system, you will have bloated > WAL and given back a large chunk of that multiple, thereby invalidating > the design premise. Or at least that's my prediction. The tests were conducted with the additional field added, so your prediction is not verified. The additional fields do not bloat WAL records - they take up exactly the same space as before. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
"Kevin Grittner" writes: > Simon Riggs wrote: >> The proposal is to use WAL to generate the logical change stream. >> That has been shown in testing to be around x4 faster than having >> a separate change stream, which must also be WAL logged (as Jan >> noted). > Sure, that's why I want it. I think this argument is basically circular. The reason it's 4x faster is that the WAL stream doesn't actually contain all the information needed to generate LCRs (thus all the angst about maintaining catalogs in sync, what to do about unfriendly datatypes, etc). By the time the dust has settled and you have a workable system, you will have bloated WAL and given back a large chunk of that multiple, thereby invalidating the design premise. Or at least that's my prediction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund wrote: > I definitely agree that low-level apply is possible as a module. Sure change > extraction needs core support but I was talking about what you need to > implement it reusing the "plain" logical support... > > What I do not understand is how you want to prevent loops in a simple manner > without in core support: > > A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT > action. > B reads the lcr stream from A and applies the changes. A new HEAP_INSERT > record. Gets decoded into the lcr stream as a INSERT action. > A reads the lcr stream from B and ??? > > At this point you need to prevent a loop. If you have the information where a > change originally happened (xl_origin_id = A in this case) you can have the > simple filter on A which ignores change records if lcr_origin_id == > local_replication_origin_id). See my email to Chris Browne, which I think covers this. It needs a bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but not a whole node ID. >> You need a backend-local hash table inside the wal reader process, and >> that hash table needs to map XIDs to node IDs. And you occasionally >> need to prune it, so that it doesn't eat too much memory. None of >> that sounds very hard. > Its not very hard. Its just more complex than what I propose(d). True, but not a whole lot more complex, and a moderate amount of complexity to save bit-space is a good trade. Especially when Tom has come down against eating up the bit space. And I agree with him. If we've only got 16 bits of padding to work with, we surely be judicious in burning them when it can be avoided for the expense of a few hundred lines of code. >> > Btw, what do you mean with "conflating" the stream? I don't really see >> > that being proposed. >> It seems to me that you are intent on using the WAL stream as the >> logical change stream. I think that's a bad design. Instead, you >> should extract changes from WAL and then ship them around in a format >> that is specific to logical replication. > No, I don't want that. I think we will need some different format once we have > agreed how changeset extraction works. I think you are saying that you agree with me that the formats should be different, but that the LCR format is undecided as yet. If that is in fact what you are saying, great. We'll need to decide that, of course, but I think there is a lot of cool stuff that can be done that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne wrote: > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas wrote: >>> Btw, what do you mean with "conflating" the stream? I don't really see that >>> being proposed. >> >> It seems to me that you are intent on using the WAL stream as the >> logical change stream. I think that's a bad design. Instead, you >> should extract changes from WAL and then ship them around in a format >> that is specific to logical replication. > > Yeah, that seems worth elaborating on. > > What has been said several times is that it's pretty necessary to > capture the logical changes into WAL. That seems pretty needful, in > order that the replication data gets fsync()ed avidly, and so that we > don't add in the race condition of needing to fsync() something *else* > almost exactly as avidly as is the case for WAL today.. Check. > But it's undesirable to pull *all* the bulk of contents of WAL around > if it's only part of the data that is going to get applied. On a > "physical streaming" replica, any logical data that gets captured will > be useless. And on a "logical replica," they "physical" bits of WAL > will be useless. > > What I *want* you to mean is that there would be: > a) WAL readers that pull the "physical bits", and > b) WAL readers that just pull "logical bits." > > I expect it would be fine to have a tool that pulls LCRs out of WAL to > prepare that to be sent to remote locations. Is that what you have in > mind? Yes. I think it should be possible to generate LCRs from WAL, but I think that the on-the-wire format for LCRs should be different from the WAL format. Trying to use the same format for both things seems like an unpleasant straightjacket. This discussion illustrates why: we're talking about consuming scarce bit-space in WAL records for a feature that only a tiny minority of users will use, and it's still not really enough bit space. That stinks. If LCR transmission is a separate protocol, this problem can be engineered away at a higher level. Suppose we have three servers, A, B, and C, that are doing multi-master replication in a loop. A sends LCRs to B, B sends them to C, and C sends them back to A. Obviously, we need to make sure that each server applies each set of changes just once, but it suffices to have enough information in WAL to distinguish between replication transactions and non-replication transactions - that is, one bit. So suppose a change is made on server A. A generates LCRs from WAL, and tags each LCR with node_id = A. It then sends those LCRs to B. B applies them, flagging the apply transaction in WAL as a replication transaction, AND ALSO sends the LCRs to C. The LCR generator on B sees the WAL from apply, but because it's flagged as a replication transaction, it does not generate LCRs. So C receives LCRs from B just once, without any need for the node_id to to be known in WAL. C can now also apply those LCRs (again flagging the apply transaction as replication) and it can also skip sending them to A, because it seems that they originated at A. Now suppose we have a more complex topology. Suppose we have a cluster of four servers A .. D which, for improved tolerance against network outages, are all connected pairwise. Normally all the links are up, so each server sends all the LCRs it generates directly to all other servers. But how do we prevent cycles? A generates a change and sends it to B, C, and D. B then sees that the change came from A so it sends it to C and D. C, receiving that change, sees that came from A via B, so it sends it to D again, whereupon D, which got it from C and knows that the origin is A, sends it to B, who will then send it right back over to D. Obviously, we've got an infinite loop here, so this topology will not work. However, there are several obvious ways to patch it by changing the LCR protocol. Most obviously, and somewhat stupidly, you could add a TTL. A bit smarter, you could have each LCR carry a LIST of node_ids that it had already visited, refusing to send it to any node it had already been to it, instead of a single node_id. Smarter still, you could send handshaking messages around the cluster so that each node can build up a spanning tree and prefix each LCR it sends with the list of additional nodes to which the recipient must deliver it. So, normally, A would send a message to each of B, C, and D destined only for that node; but if the A-C link went down, A would choose either B or D and send each LCR to that node destined for that node *and C*; then, A would forward the message. Or perhaps you think this is too complex and not worth supporting anyway, and that might be true, but the point is that if you insist that all of the identifying information must be carried in WAL, you've pretty much ruled it out, because we are not going to put TTL fields, or lists of node IDs, or lists of destinations, in WAL. But there is no reason they can't be attached to LCRs, whi
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund wrote: > Yes, thats definitely a valid use-case. But that doesn't preclude > the other - also not uncommon - use-case where you want to have > different master which all contain up2date data. I agree. I was just saying that while one requires an origin_id, the other doesn't. And those not doing MM replication definitely don't need it. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote: > Simon Riggs wrote: > > If we use WAL in this way, multi-master implies that the data will > > *always* be in a loop. So in any configuration we must be able to > > tell difference between changes made by one node and another. > > Only if you assume that multi-master means identical databases all > replicating the same data to all the others. If I have 72 master > replicating non-conflicting data to one consolidated database, I > consider that to be multi-master, too. > ... > Of course, none of these databases have the same OID for any given > object, and there are numerous different schemas among the > replicating databases, so I need to get to table and column names > before the data is of any use to me. Yes, thats definitely a valid use-case. But that doesn't preclude the other - also not uncommon - use-case where you want to have different master which all contain up2date data. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Simon Riggs wrote: > The proposal is to use WAL to generate the logical change stream. > That has been shown in testing to be around x4 faster than having > a separate change stream, which must also be WAL logged (as Jan > noted). Sure, that's why I want it. > If we use WAL in this way, multi-master implies that the data will > *always* be in a loop. So in any configuration we must be able to > tell difference between changes made by one node and another. Only if you assume that multi-master means identical databases all replicating the same data to all the others. If I have 72 master replicating non-conflicting data to one consolidated database, I consider that to be multi-master, too. Especially if I have other types of databases replicating disjoint data to the same consolidated database, and the 72 sources have several databases replicating disjoint sets of data to them. We have about 1000 replications paths, none of which create a loop which can send data back to the originator or cause conflicts. Of course, none of these databases have the same OID for any given object, and there are numerous different schemas among the replicating databases, so I need to get to table and column names before the data is of any use to me. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund wrote: > > More seriously: Even if we don't put MM in core I think putting the basis > > for it in core so that somebody can build such a solution reusing the > > existing infrastructure is a sensible idea. Imo the only thing that > > requires explicit support which is hard to add outside of core is > > prevention of loops (aka this patch). Everything else should be doable > > reusing the hopefully modular pieces. > I don't think prevention of loops is hard to do outside of core > either, unless you insist on tying "logical" replication so closely to > WAL that anyone doing MMR is necessarily getting the change stream > from WAL. In fact, I'd go so far as to say that the ONLY part of this > that's hard to do outside of core is change extraction. Even > low-level apply can be implemented as a loadable module. I definitely agree that low-level apply is possible as a module. Sure change extraction needs core support but I was talking about what you need to implement it reusing the "plain" logical support... What I do not understand is how you want to prevent loops in a simple manner without in core support: A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT action. B reads the lcr stream from A and applies the changes. A new HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT action. A reads the lcr stream from B and ??? At this point you need to prevent a loop. If you have the information where a change originally happened (xl_origin_id = A in this case) you can have the simple filter on A which ignores change records if lcr_origin_id == local_replication_origin_id). > >> Right. If we decide we need this, and if we did decide to conflate > >> the WAL stream, both of which I disagree with as noted above, then we > >> still don't need it on every record. It would probably be sufficient > >> for local transactions to do nothing at all (and we can implicitly > >> assume that they have master node ID = local node ID) and transactions > >> which are replaying remote changes to emit one record per XID per > >> checkpoint cycle containing the remote node ID. > > > > Youve gone from a pretty trivial 150 line patch without any runtime/space > > overhead to something *considerably* more complex in that case though. > > > > You suddently need to have relatively complex logic to remember which > > information you got for a certain xid (and forget that information > > afterwards) and whether you already logged that xid and you need to have > > to decide about logging that information at multiple places. > You need a backend-local hash table inside the wal reader process, and > that hash table needs to map XIDs to node IDs. And you occasionally > need to prune it, so that it doesn't eat too much memory. None of > that sounds very hard. Its not very hard. Its just more complex than what I propose(d). > > Btw, what do you mean with "conflating" the stream? I don't really see > > that being proposed. > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. No, I don't want that. I think we will need some different format once we have agreed how changeset extraction works. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 00:11, Tom Lane wrote: > Andres Freund writes: >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: ... (If you are thinking of something sufficiently high-level that merging could possibly work, then it's not WAL, and we shouldn't be trying to make the WAL representation cater for it.) > >> Do you really see this as such a big problem? > > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. The evidence from prototypes shown at PgCon was that using the WAL in this way was very efficient and that this level of efficiency is necessary to make it work in a practical manner. Postgres would not be the first database to follow this broad design. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. Agreed. In this case, the origin node id information needs to be available on the WAL record so we can generate the logical changes (LCRs). > Saving a couple bytes in each such record > is penny-wise and pound-foolish, I'm afraid; especially when you're > nailing down hard, unexpansible limits at the very beginning of the > development process in order to save those bytes. Restricting the number of node ids is being done so that there is no impact on anybody not using this feature. In later implementations, it should be possible to support greater numbers of node ids by having a variable length header. But that is an unnecessary complication for a first release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 05:59, Christopher Browne wrote: > But it's undesirable to pull *all* the bulk of contents of WAL around > if it's only part of the data that is going to get applied. On a > "physical streaming" replica, any logical data that gets captured will > be useless. And on a "logical replica," they "physical" bits of WAL > will be useless. Completely agree. What we're discussing here is the basic information content of a WAL record to enable the construction of suitable LCRs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 04:31, Kevin Grittner wrote: > I've done a lot of MM replication, > and so far have not had to use a topology which allowed loops. The proposal is to use WAL to generate the logical change stream. That has been shown in testing to be around x4 faster than having a separate change stream, which must also be WAL logged (as Jan noted). If we use WAL in this way, multi-master implies that the data will *always* be in a loop. So in any configuration we must be able to tell difference between changes made by one node and another. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote: > On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: > > On 13 June 2012 19:28, Andres Freund wrote: > >> This adds a new configuration parameter multimaster_node_id which > >> determines the id used for wal originating in one cluster. > > > > Looks good and it seems this aspect at least is commitable in this CF. > > > > Design decisions I think we need to review are > > > > * Naming of field. I think origin is the right term, borrowing from > > Slony. > > I have not read too deeply here, so maybe I am missing > some important detail here, but idea that users need > to coordinate a integer config parameter globally does not > sound too attractive to me. > > Why not limit integers to local storage only and map > them to string idents on config, UI and transport? That should be possible. I don't want to go there till we have the base stuff in but it shouldn't be too hard. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 05:46, Robert Haas wrote: > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. The proposal is to read the WAL in order to generate LCRs. The information needs to be in the original data in order to allow it to be passed onwards. In a multi-master config there will be WAL records with many origin node ids at any time, so this information must be held at WAL record level not page level. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas wrote: >> Btw, what do you mean with "conflating" the stream? I don't really see that >> being proposed. > > It seems to me that you are intent on using the WAL stream as the > logical change stream. I think that's a bad design. Instead, you > should extract changes from WAL and then ship them around in a format > that is specific to logical replication. Yeah, that seems worth elaborating on. What has been said several times is that it's pretty necessary to capture the logical changes into WAL. That seems pretty needful, in order that the replication data gets fsync()ed avidly, and so that we don't add in the race condition of needing to fsync() something *else* almost exactly as avidly as is the case for WAL today.. But it's undesirable to pull *all* the bulk of contents of WAL around if it's only part of the data that is going to get applied. On a "physical streaming" replica, any logical data that gets captured will be useless. And on a "logical replica," they "physical" bits of WAL will be useless. What I *want* you to mean is that there would be: a) WAL readers that pull the "physical bits", and b) WAL readers that just pull "logical bits." I expect it would be fine to have a tool that pulls LCRs out of WAL to prepare that to be sent to remote locations. Is that what you have in mind? Or are you feeling that the "logical bits" shouldn't get captured in WAL altogether, so we need to fsync() them into a different stream of files? -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 19 June 2012 14:03, Tom Lane wrote: > "Every WAL record"? Why in heck would you attach it to every record? > Surely putting it in WAL page headers would be sufficient. We could > easily afford to burn a page switch (if not a whole segment switch) > when changing masters. This does appear to be a reasonable idea at first glance, since it seems that each node has just a single node id, but that is not the case. As we pass changes around we maintain the same origin id for a change, so there is a mix of origin node ids at the WAL record level, not the page level. The concept of originating node id is essentially same as that used in Slony. > I'm against the idea of eating any spare space we have in WAL record > headers for this purpose, anyway; there are likely to be more pressing > needs in future. Not sure what those pressing needs are, but I can't see any. What we are doing here is fairly important, just not as important as crash recovery. But then that has worked pretty much unchanged for some time now. I raised the possibility of having variable length headers, but there is no requirement for that yet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund wrote: > More seriously: Even if we don't put MM in core I think putting the basis for > it in core so that somebody can build such a solution reusing the existing > infrastructure is a sensible idea. Imo the only thing that requires explicit > support which is hard to add outside of core is prevention of loops (aka this > patch). Everything else should be doable reusing the hopefully modular pieces. I don't think prevention of loops is hard to do outside of core either, unless you insist on tying "logical" replication so closely to WAL that anyone doing MMR is necessarily getting the change stream from WAL. In fact, I'd go so far as to say that the ONLY part of this that's hard to do outside of core is change extraction. Even low-level apply can be implemented as a loadable module. >> Right. If we decide we need this, and if we did decide to conflate >> the WAL stream, both of which I disagree with as noted above, then we >> still don't need it on every record. It would probably be sufficient >> for local transactions to do nothing at all (and we can implicitly >> assume that they have master node ID = local node ID) and transactions >> which are replaying remote changes to emit one record per XID per >> checkpoint cycle containing the remote node ID. > Youve gone from a pretty trivial 150 line patch without any runtime/space > overhead to something *considerably* more complex in that case though. > > You suddently need to have relatively complex logic to remember which > information you got for a certain xid (and forget that information afterwards) > and whether you already logged that xid and you need to have to decide about > logging that information at multiple places. You need a backend-local hash table inside the wal reader process, and that hash table needs to map XIDs to node IDs. And you occasionally need to prune it, so that it doesn't eat too much memory. None of that sounds very hard. > Btw, what do you mean with "conflating" the stream? I don't really see that > being proposed. It seems to me that you are intent on using the WAL stream as the logical change stream. I think that's a bad design. Instead, you should extract changes from WAL and then ship them around in a format that is specific to logical replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 20 June 2012 04:58, Marko Kreen wrote: > On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: >> On 13 June 2012 19:28, Andres Freund wrote: >>> This adds a new configuration parameter multimaster_node_id which determines >>> the id used for wal originating in one cluster. >> >> Looks good and it seems this aspect at least is commitable in this CF. >> >> Design decisions I think we need to review are >> >> * Naming of field. I think origin is the right term, borrowing from Slony. > > I have not read too deeply here, so maybe I am missing > some important detail here, but idea that users need > to coordinate a integer config parameter globally does not > sound too attractive to me. > > Why not limit integers to local storage only and map > them to string idents on config, UI and transport? Yes, that can be done. This is simply how the numbers are used internally, similar to oids. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs wrote: > On 13 June 2012 19:28, Andres Freund wrote: >> This adds a new configuration parameter multimaster_node_id which determines >> the id used for wal originating in one cluster. > > Looks good and it seems this aspect at least is commitable in this CF. > > Design decisions I think we need to review are > > * Naming of field. I think origin is the right term, borrowing from Slony. I have not read too deeply here, so maybe I am missing some important detail here, but idea that users need to coordinate a integer config parameter globally does not sound too attractive to me. Why not limit integers to local storage only and map them to string idents on config, UI and transport? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund wrote: > Robert Haas wrote: >> Tom Lane wrote: >>> However, if we're dead set on doing it that way, let us put >>> information that is only relevant to logical replication records >>> into only the logical replication records. >> Right. If we decide we need this, and if we did decide to >> conflate the WAL stream, both of which I disagree with as noted >> above, then we still don't need it on every record. It would >> probably be sufficient for local transactions to do nothing at >> all (and we can implicitly assume that they have master node ID = >> local node ID) and transactions which are replaying remote >> changes to emit one record per XID per checkpoint cycle >> containing the remote node ID. > Youve gone from a pretty trivial 150 line patch without any > runtime/space overhead to something *considerably* more complex in > that case though. I think it might be worth it. I've done a lot of MM replication, and so far have not had to use a topology which allowed loops. Not only would you be reserving space in the WAL stream which was not useful for those not using MM replication, you would be reserving it when even many MM configurations would not need it. Now you could argue that the 16 bits you want to use are already there and are not yet used for anything; but there are two counter-arguments to that: you lose the opportunity to use them for something else, and you might want more than 16 bits. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote: > On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane wrote: > > Andres Freund writes: > >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: > ... (If you are thinking > of something sufficiently high-level that merging could possibly work, > then it's not WAL, and we shouldn't be trying to make the WAL > representation cater for it.) > >> > >> Do you really see this as such a big problem? > > > > It looks suspiciously like "I have a hammer, therefore every problem > > must be a nail". I don't like the design concept of cramming logical > > replication records into WAL in the first place. > > Me, neither. I think it's necessary to try to find a way of > generating logical replication records from WAL. But once generated, > I think those records should form their own stream, independent of > WAL. If you take the contrary position that they should be included > in WAL, then when you filter the WAL stream down to just the records > of interest to logical replication, you end up with a WAL stream with > holes in it, which is one of the things that Andres listed as an > unresolved design problem in his original email. Yes. > Moreover, this isn't necessary at all for single-master replication, > or even multi-source replication where each table has a single master. > It's only necessary for full multi-master replication, which we have > no consensus to include in core, and even if we did have a consensus > to include it in core, it certainly shouldn't be the first feature we > design. Well, you can't blame a patch/prototype trying to implement what it claims to implement ;) More seriously: Even if we don't put MM in core I think putting the basis for it in core so that somebody can build such a solution reusing the existing infrastructure is a sensible idea. Imo the only thing that requires explicit support which is hard to add outside of core is prevention of loops (aka this patch). Everything else should be doable reusing the hopefully modular pieces. > > However, if we're dead set on doing it that way, let us put information > > that is only relevant to logical replication records into only the > > logical replication records. > Right. If we decide we need this, and if we did decide to conflate > the WAL stream, both of which I disagree with as noted above, then we > still don't need it on every record. It would probably be sufficient > for local transactions to do nothing at all (and we can implicitly > assume that they have master node ID = local node ID) and transactions > which are replaying remote changes to emit one record per XID per > checkpoint cycle containing the remote node ID. Youve gone from a pretty trivial 150 line patch without any runtime/space overhead to something *considerably* more complex in that case though. You suddently need to have relatively complex logic to remember which information you got for a certain xid (and forget that information afterwards) and whether you already logged that xid and you need to have to decide about logging that information at multiple places. Btw, what do you mean with "conflating" the stream? I don't really see that being proposed. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane wrote: > Andres Freund writes: >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: ... (If you are thinking of something sufficiently high-level that merging could possibly work, then it's not WAL, and we shouldn't be trying to make the WAL representation cater for it.) > >> Do you really see this as such a big problem? > > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. Me, neither. I think it's necessary to try to find a way of generating logical replication records from WAL. But once generated, I think those records should form their own stream, independent of WAL. If you take the contrary position that they should be included in WAL, then when you filter the WAL stream down to just the records of interest to logical replication, you end up with a WAL stream with holes in it, which is one of the things that Andres listed as an unresolved design problem in his original email. Moreover, this isn't necessary at all for single-master replication, or even multi-source replication where each table has a single master. It's only necessary for full multi-master replication, which we have no consensus to include in core, and even if we did have a consensus to include it in core, it certainly shouldn't be the first feature we design. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. Right. If we decide we need this, and if we did decide to conflate the WAL stream, both of which I disagree with as noted above, then we still don't need it on every record. It would probably be sufficient for local transactions to do nothing at all (and we can implicitly assume that they have master node ID = local node ID) and transactions which are replaying remote changes to emit one record per XID per checkpoint cycle containing the remote node ID. > Saving a couple bytes in each such record > is penny-wise and pound-foolish, I'm afraid; especially when you're > nailing down hard, unexpansible limits at the very beginning of the > development process in order to save those bytes. I completely agree. I think that, as Dan said upthread, having a 64 or 128 bit ID so that it can be generated automatically rather than configured by an administrator who must be careful not to duplicate node IDs in any pair of systems that could ever end up talking to each other would be a vast usability improvement. Perhaps systems A, B, and C are replicating to each other today, as are systems D and E. But now suppose that someone decides they want to replicate one table between A and D. Suddenly the node IDs have to be distinct where they didn't before, and now there's potentially a problem to hassle with that wouldn't have been an issue if the node IDs had been wide enough to begin with. It is not unusual for people to decide after-the-fact to begin replicating between machines where this wasn't originally anticipated and which may even be even be under different administrative control. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote: > Andres Freund writes: > > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: > >>> ... (If you are thinking > >>> of something sufficiently high-level that merging could possibly work, > >>> then it's not WAL, and we shouldn't be trying to make the WAL > >>> representation cater for it.) > > > > Do you really see this as such a big problem? > It looks suspiciously like "I have a hammer, therefore every problem > must be a nail". I don't like the design concept of cramming logical > replication records into WAL in the first place. There are - so far - no specific "logical replication records". Its a relatively minor amount of additional data under wal_level=logical for existing records. HEAP_UPDATE gets the old primary key on updates changing the pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE| DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block than the page to handle full page writes. Thats it. I can definitely understand hesitation about that, but I simply see no realistic way to solve the issues of existing replication solutions otherwise. Do you have a better idea to solve those than the above? Without significant complications of the backend code and without loads of additional writes going on? I *really* would like to hear them if you do. > However, if we're dead set on doing it that way, let us put information > that is only relevant to logical replication records into only the > logical replication records. I found, and still do, the idea of having the origin_id in there rather elegant. If people prefer adding the same block to all of the above xlog records: I can live with that and will then do so. It makes some things more complicated, but its not too bad. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund writes: > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: >>> ... (If you are thinking >>> of something sufficiently high-level that merging could possibly work, >>> then it's not WAL, and we shouldn't be trying to make the WAL >>> representation cater for it.) > Do you really see this as such a big problem? It looks suspiciously like "I have a hammer, therefore every problem must be a nail". I don't like the design concept of cramming logical replication records into WAL in the first place. However, if we're dead set on doing it that way, let us put information that is only relevant to logical replication records into only the logical replication records. Saving a couple bytes in each such record is penny-wise and pound-foolish, I'm afraid; especially when you're nailing down hard, unexpansible limits at the very beginning of the development process in order to save those bytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote: > Andres Freund writes: > > On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote: > >> ... (If you are thinking > >> of something sufficiently high-level that merging could possibly work, > >> then it's not WAL, and we shouldn't be trying to make the WAL > >> representation cater for it.) > > > > The idea is that if youre replaying changes on node A originating from > > node B you set the origin to *B* in the wal records that are generated > > during that. So when B, in a bidirectional setup, replays the changes > > that A has made it can simply ignore all changes which originated on > > itself. > > This is most certainly not possible at the level of WAL. Huh? This isn't used during normal crash-recovery replay. The information is used when decoding the wal into logical changes and applying those. Its just a common piece of information thats needed for a large number of records. Alternatively it could be added to all the records that need it, but that would smear the necessary logic - which is currently trivial - over more of the backend. And it would increase the actual size of wal which this one did not. > As I said above, we shouldn't be trying to shoehorn high level logical- > replication commands into WAL streams. No good can come of confusing those > concepts. Its not doing anything high-level in there? All that patch does is embedding one single piece of information in previously unused space. I can follow the argument that you do not want *any* logical information in the wal. But as I said in the patchset-introducing email: I don't really see an alternative. Otherwise we would just duplicate all the locking/scalability issues of xlog as well as the amount of writes. This is, besides logging some more informations when wal_level = logical in some particular records (HEAP_UPDATE|DELETE and ensuring fpw's don't remove the record data in HEAP_(INSERT|UPDATE|DELETE) in patch 07/16 the only change that I really forsee being needed for doing the logical stuff. Do you really see this as such a big problem? Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund writes: > On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote: >> ... (If you are thinking >> of something sufficiently high-level that merging could possibly work, >> then it's not WAL, and we shouldn't be trying to make the WAL >> representation cater for it.) > The idea is that if youre replaying changes on node A originating from node B > you set the origin to *B* in the wal records that are generated during that. > So when B, in a bidirectional setup, replays the changes that A has made it > can simply ignore all changes which originated on itself. This is most certainly not possible at the level of WAL. As I said above, we shouldn't be trying to shoehorn high level logical-replication commands into WAL streams. No good can come of confusing those concepts. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote: > Andres Freund writes: > > On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote: > >> "Every WAL record"? Why in heck would you attach it to every record? > >> Surely putting it in WAL page headers would be sufficient. > > > > The idea is that you can have cascading, circular and whatever > > replication topologies if you include the "logical origin" of a wal > > causing action into it. > > That is, if you have nodes A(1) and B(2) and a insert happens on A the > > wal records generated by that will get an xl_origin_id = 1 and when it > > will be decoded and replayed on B it will *also* get the id 1. Only when > > a change originally is generated on Bit will get xl_origin_id = 2. > > None of this explains to me why each individual WAL record would need to > carry a separate copy of the indicator. We don't have multiple masters > putting records into the same WAL file, nor does it seem to me that it > could possibly be workable to merge WAL streams. (If you are thinking > of something sufficiently high-level that merging could possibly work, > then it's not WAL, and we shouldn't be trying to make the WAL > representation cater for it.) The idea is that if youre replaying changes on node A originating from node B you set the origin to *B* in the wal records that are generated during that. So when B, in a bidirectional setup, replays the changes that A has made it can simply ignore all changes which originated on itself. That works rather simple & performant if you have a conflict avoidance scheme. For many scenarios you need to be able to separate locally generated changes and changes that have been replayed from another node. Without support of something like this this is really hard to achieve. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund writes: > On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote: >> "Every WAL record"? Why in heck would you attach it to every record? >> Surely putting it in WAL page headers would be sufficient. > The idea is that you can have cascading, circular and whatever replication > topologies if you include the "logical origin" of a wal causing action into > it. > That is, if you have nodes A(1) and B(2) and a insert happens on A the wal > records generated by that will get an xl_origin_id = 1 and when it will be > decoded and replayed on B it will *also* get the id 1. Only when a change > originally is generated on Bit will get xl_origin_id = 2. None of this explains to me why each individual WAL record would need to carry a separate copy of the indicator. We don't have multiple masters putting records into the same WAL file, nor does it seem to me that it could possibly be workable to merge WAL streams. (If you are thinking of something sufficiently high-level that merging could possibly work, then it's not WAL, and we shouldn't be trying to make the WAL representation cater for it.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote: > On 12-06-18 07:30 AM, Andres Freund wrote: > > Hrmpf #666. I will go through through the series commit-by-commit again > > to make sure everything compiles again. Reordinging this late definitely > > wasn't a good idea... > > > > I pushed a rebased version with all those fixups (and removal of the > > zeroRecPtr patch). > > Where did you push that rebased version to? I don't see an attachment, > or an updated patch in the commitfest app and your repo at > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa > ry hasn't been updated in 5 days. To the 2ndquadrant internal repo. Which strangely doesn't help you. *Headdesk*. Pushed to the correct repo and manually verified. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi, On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote: > Andres Freund writes: > > On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote: > >> What's the cost of going a lot higher? Because if one makes enough > >> numerical space available, one can assign node identities without a > >> coordinator, a massive decrease in complexity. > > > > It would increase the size of every wal record. We just have 16bit left > > there by chance... > > "Every WAL record"? Why in heck would you attach it to every record? > Surely putting it in WAL page headers would be sufficient. We could > easily afford to burn a page switch (if not a whole segment switch) > when changing masters. The idea is that you can have cascading, circular and whatever replication topologies if you include the "logical origin" of a wal causing action into it. That is, if you have nodes A(1) and B(2) and a insert happens on A the wal records generated by that will get an xl_origin_id = 1 and when it will be decoded and replayed on B it will *also* get the id 1. Only when a change originally is generated on Bit will get xl_origin_id = 2. That way you can easily have circular or hierarchical replication topologies including diamonds. > I'm against the idea of eating any spare space we have in WAL record > headers for this purpose, anyway; there are likely to be more pressing > needs in future. Every other solution to allowing this seems to be far more complicated than this, thats why I arrived at the conclusion that its a good idea. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Tuesday, June 19, 2012 12:47:54 AM Christopher Browne wrote: > On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund wrote: > > Hi Simon, > > > > On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote: > >> On 13 June 2012 19:28, Andres Freund wrote: > >> > This adds a new configuration parameter multimaster_node_id which > >> > determines the id used for wal originating in one cluster. > >> > >> Looks good and it seems this aspect at least is commitable in this CF. > > > > I think we need to agree on the parameter name. It currently is > > 'multimaster_node_id'. In the discussion with Steve we got to > > "replication_node_id". I don't particularly like either. > > > > Other suggestions? > I wonder if it should be origin_node_id? That is the term Slony uses. I think in the slony context its clear that thats related to replication. Less so in a general postgres guc. So maybe replication_origin_node_id? > >> * Size of field. 16 bits is enough for 32,000 master nodes, which is > >> quite a lot. Do we need that many? I think we may have need for a few > >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe > >> 8 bits. Even if we don't need them in this release, I'd like to have > >> them. If they remain unused after a few releases, we may choose to > >> redeploy some of them as additional nodeids in future. I don't foresee > >> complaints that 256 master nodes is too few anytime soon, so we can > >> defer that decision. > > > > I wished we had some flag bits available before as well. I find 256 nodes > > a pretty low value to start with though, 4096 sounds better though, so I > > would be happy with 4 flag bits. I think for cascading setups and such > > you want to add node ids for every node, not only masters... > Even though the number of nodes that can reasonably participate in > replication is likely to be not too terribly large, it might be good > to allow larger values, in case someone is keen on encoding something > descriptive in the node number. Well, having a large number space makes the wal records bigger which is not something I can see us getting through with. People have gone through considerable length to avoid that. Perhaps we can have a mapping system catalog at some point that includes additional infromation to each node id like a name and where its at wrt replication... > I recall the Slony-II project having a notion of attaching a permanent > UUID-based node ID to each node. As long as there is somewhere decent > to find a symbolically significant node "name," I like the idea of the > ID *not* being in a tiny range, and being UUID/OID-like... I think adding 14 bytes (16bytes of an ooid - 2 bytes available) is out of question... > > Any opinions from others on this? > > > >> * Do we want origin_id as a parameter or as a setting in pgcontrol? > >> IIRC we go to a lot of trouble elsewhere to avoid problems with > >> changing on/off parameter values. I think we need some discussion to > >> validate where that should live. > > > > Hm. I don't really forsee any need to have it in pg_control. What do you > > want to protect against with that? > > It would need to be changeable anyway, because otherwise it would need to > > become a parameter for initdb which would suck for anybody migrating to > > use replication at some point. > > > > Do you want to protect against problems in replication setups after > > changing the value? > In Slony, changing the node ID is Not Something That Is Done. The ID > is captured in *way* too many places to be able to have any hope of > updating it in a coordinated way. I should be surprised if it wasn't > similarly troublesome here. If you update you will need to reset consuming nodes, yes. Imo thats still something else than disallowing changing the parameter entirely. Requiring initdb for that seems like it would make experimentation too hard. We need to allow at least changing the setting from no node id to an initial one. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Andres Freund writes: > On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote: >> What's the cost of going a lot higher? Because if one makes enough >> numerical space available, one can assign node identities without a >> coordinator, a massive decrease in complexity. > It would increase the size of every wal record. We just have 16bit left there > by chance... "Every WAL record"? Why in heck would you attach it to every record? Surely putting it in WAL page headers would be sufficient. We could easily afford to burn a page switch (if not a whole segment switch) when changing masters. I'm against the idea of eating any spare space we have in WAL record headers for this purpose, anyway; there are likely to be more pressing needs in future. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 12-06-18 11:50 AM, Andres Freund wrote: Hi Simon, I think we need to agree on the parameter name. It currently is 'multimaster_node_id'. In the discussion with Steve we got to "replication_node_id". I don't particularly like either. Other suggestions? Other things that come to mind (for naming this parameter in the postgresql.conf) node_id origin_node_id local_node_id I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? 256 sounds a bit low to me as well. Sometimes the use case of a retail chain comes up where people want each store to have a postgresql instance and replicate back to a central office. I can think of many chains with more than 256 stores. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 12-06-18 07:30 AM, Andres Freund wrote: Hrmpf #666. I will go through through the series commit-by-commit again to make sure everything compiles again. Reordinging this late definitely wasn't a good idea... I pushed a rebased version with all those fixups (and removal of the zeroRecPtr patch). Where did you push that rebased version to? I don't see an attachment, or an updated patch in the commitfest app and your repo at http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary hasn't been updated in 5 days. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund wrote: > Hi Simon, > > On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote: >> On 13 June 2012 19:28, Andres Freund wrote: >> > This adds a new configuration parameter multimaster_node_id which >> > determines the id used for wal originating in one cluster. >> >> Looks good and it seems this aspect at least is commitable in this CF. > I think we need to agree on the parameter name. It currently is > 'multimaster_node_id'. In the discussion with Steve we got to > "replication_node_id". I don't particularly like either. > > Other suggestions? I wonder if it should be origin_node_id? That is the term Slony uses. >> Design decisions I think we need to review are >> >> * Naming of field. I think origin is the right term, borrowing from Slony. > I think it fits as well. > >> * Can we add the origin_id dynamically to each WAL record? Probably no >> need, but lets consider why and document that. > Not sure what you mean? Its already set in XLogInsert to > current_replication_origin_id which defaults to the value of the guc? > >> * Size of field. 16 bits is enough for 32,000 master nodes, which is >> quite a lot. Do we need that many? I think we may have need for a few >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe >> 8 bits. Even if we don't need them in this release, I'd like to have >> them. If they remain unused after a few releases, we may choose to >> redeploy some of them as additional nodeids in future. I don't foresee >> complaints that 256 master nodes is too few anytime soon, so we can >> defer that decision. > I wished we had some flag bits available before as well. I find 256 nodes a > pretty low value to start with though, 4096 sounds better though, so I would > be happy with 4 flag bits. I think for cascading setups and such you want to > add node ids for every node, not only masters... Even though the number of nodes that can reasonably participate in replication is likely to be not too terribly large, it might be good to allow larger values, in case someone is keen on encoding something descriptive in the node number. If you restrict the number to a tiny range, then you'll be left wanting some other mapping. At one point, I did some work trying to get a notion of named nodes implemented in Slony; gave up on it, as the coordination process was wildly too bug-prone. In our environment, at Afilias, we have used quasi-symbolic node numbers that encoded something somewhat meaningful about the environment. That seems better to me than the risky "kludge" of saying: - The first node I created is node #1 - The second one is node #2. - The third and fourth are #3 and #4 - I dropped node #2 due to a problem, and thus the "new node 2" is #5. That numbering scheme gets pretty anti-intuitive fairly quickly, from whence we took the approach of having a couple digits indicating data centre followed by a digit indicating which node in that data centre. If that all sounds incoherent, well, the more nodes you have around, the more difficult it becomes to make sure you *do* have a coherent picture of your cluster. I recall the Slony-II project having a notion of attaching a permanent UUID-based node ID to each node. As long as there is somewhere decent to find a symbolically significant node "name," I like the idea of the ID *not* being in a tiny range, and being UUID/OID-like... > Any opinions from others on this? > >> * Do we want origin_id as a parameter or as a setting in pgcontrol? >> IIRC we go to a lot of trouble elsewhere to avoid problems with >> changing on/off parameter values. I think we need some discussion to >> validate where that should live. > Hm. I don't really forsee any need to have it in pg_control. What do you want > to protect against with that? > It would need to be changeable anyway, because otherwise it would need to > become a parameter for initdb which would suck for anybody migrating to use > replication at some point. > > Do you want to protect against problems in replication setups after changing > the value? In Slony, changing the node ID is Not Something That Is Done. The ID is captured in *way* too many places to be able to have any hope of updating it in a coordinated way. I should be surprised if it wasn't similarly troublesome here. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote: > On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund wrote: > >> * Size of field. 16 bits is enough for 32,000 master nodes, which is > >> quite a lot. Do we need that many? I think we may have need for a few > >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe > >> 8 bits. Even if we don't need them in this release, I'd like to have > >> them. If they remain unused after a few releases, we may choose to > >> redeploy some of them as additional nodeids in future. I don't foresee > >> complaints that 256 master nodes is too few anytime soon, so we can > >> defer that decision. > > > > I wished we had some flag bits available before as well. I find 256 nodes > > a pretty low value to start with though, 4096 sounds better though, so I > > would be happy with 4 flag bits. I think for cascading setups and such > > you want to add node ids for every node, not only masters... > > > > Any opinions from others on this? > > What's the cost of going a lot higher? Because if one makes enough > numerical space available, one can assign node identities without a > coordinator, a massive decrease in complexity. It would increase the size of every wal record. We just have 16bit left there by chance... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund wrote: >> * Size of field. 16 bits is enough for 32,000 master nodes, which is >> quite a lot. Do we need that many? I think we may have need for a few >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe >> 8 bits. Even if we don't need them in this release, I'd like to have >> them. If they remain unused after a few releases, we may choose to >> redeploy some of them as additional nodeids in future. I don't foresee >> complaints that 256 master nodes is too few anytime soon, so we can >> defer that decision. > I wished we had some flag bits available before as well. I find 256 nodes a > pretty low value to start with though, 4096 sounds better though, so I would > be happy with 4 flag bits. I think for cascading setups and such you want to > add node ids for every node, not only masters... > > Any opinions from others on this? What's the cost of going a lot higher? Because if one makes enough numerical space available, one can assign node identities without a coordinator, a massive decrease in complexity. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi Simon, On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote: > On 13 June 2012 19:28, Andres Freund wrote: > > This adds a new configuration parameter multimaster_node_id which > > determines the id used for wal originating in one cluster. > > Looks good and it seems this aspect at least is commitable in this CF. I think we need to agree on the parameter name. It currently is 'multimaster_node_id'. In the discussion with Steve we got to "replication_node_id". I don't particularly like either. Other suggestions? > Design decisions I think we need to review are > > * Naming of field. I think origin is the right term, borrowing from Slony. I think it fits as well. > * Can we add the origin_id dynamically to each WAL record? Probably no > need, but lets consider why and document that. Not sure what you mean? Its already set in XLogInsert to current_replication_origin_id which defaults to the value of the guc? > * Size of field. 16 bits is enough for 32,000 master nodes, which is > quite a lot. Do we need that many? I think we may have need for a few > flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe > 8 bits. Even if we don't need them in this release, I'd like to have > them. If they remain unused after a few releases, we may choose to > redeploy some of them as additional nodeids in future. I don't foresee > complaints that 256 master nodes is too few anytime soon, so we can > defer that decision. I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? > * Do we want origin_id as a parameter or as a setting in pgcontrol? > IIRC we go to a lot of trouble elsewhere to avoid problems with > changing on/off parameter values. I think we need some discussion to > validate where that should live. Hm. I don't really forsee any need to have it in pg_control. What do you want to protect against with that? It would need to be changeable anyway, because otherwise it would need to become a parameter for initdb which would suck for anybody migrating to use replication at some point. Do you want to protect against problems in replication setups after changing the value? > * Is there any overhead from CRC of WAL record because of this? I'd > guess not, but just want to double check thinking. I cannot imagine that there is. The actual size of the record didn't change because of alignment padding (both on 32 and 64 bit systems). > Presumably there is no issue wrt Heikki's WAL changes? I assume not, > but ask since I know you're reviewing that also. It might clash minimally because of offset changes or such, but otherwise there shouldn't be much. Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 13 June 2012 19:28, Andres Freund wrote: > This adds a new configuration parameter multimaster_node_id which determines > the id used for wal originating in one cluster. Looks good and it seems this aspect at least is commitable in this CF. Design decisions I think we need to review are * Naming of field. I think origin is the right term, borrowing from Slony. * Can we add the origin_id dynamically to each WAL record? Probably no need, but lets consider why and document that. * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. * Do we want origin_id as a parameter or as a setting in pgcontrol? IIRC we go to a lot of trouble elsewhere to avoid problems with changing on/off parameter values. I think we need some discussion to validate where that should live. * Is there any overhead from CRC of WAL record because of this? I'd guess not, but just want to double check thinking. Presumably there is no issue wrt Heikki's WAL changes? I assume not, but ask since I know you're reviewing that also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Monday, June 18, 2012 02:43:26 AM Steve Singer wrote: > On 12-06-13 01:27 PM, Andres Freund wrote: > > The previous mail contained a patch with a mismerge caused by reording > > commits. Corrected version attached. > > > > Thanks to Steve Singer for noticing this quickly. > > Attached is a more complete review of this patch. > > I agree that we will need to identify the node a change originated at. > We will not only want this for multi-master support but it might also be > very helpful once we introduce things like cascaded replicas. Using a 16 > bit integer for this purpose makes sense to me. Good. > This patch (with the previous numbered patches already applied), still > doesn't compile. > > gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute > -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include > -D_GNU_SOURCE -c -o xact.o xact.c > xact.c: In function 'xact_redo_commit': > xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn' > make[4]: *** [xact.o] Error 1 > > Your complete patch set did compile. origin_lsn gets added as part of > your 12'th patch. Managing so many related patches is going to be a > pain. but it beats one big patch. I don't think this patch actually > requires the origin_lsn change. Hrmpf #666. I will go through through the series commit-by-commit again to make sure everything compiles again. Reordinging this late definitely wasn't a good idea... I pushed a rebased version with all those fixups (and removal of the zeroRecPtr patch). > > Code Review > - > src/backend/utils/misc/guc.c > @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] = > }, > > { > +{"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER, > +gettext_noop("node id for multimaster."), > +NULL > +}, > + &guc_replication_origin_id, > +InvalidMultimasterNodeId, InvalidMultimasterNodeId, > MaxMultimasterNodeId, > +NULL, assign_replication_node_id, NULL > +}, > > I'd rather see us refer to this as the 'node id for logical replication' > over the multimaster node id. I think that terminology will be less > controversial. Youre right. 'replication_node_id' or such should be ok? > BootStrapXLOG in xlog.c > creates a XLogRecord structure and shouldit set xl_origin_id to the > InvalidMultimasterNodeId? > WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a > well defined value. I think InvalidMultimasterNodeId should be safe > even for a no-op record in from a node that actually has a node_id set > on real records. Good catches. > backend/replication/logical/logical.c: > XLogRecPtr current_replication_origin_lsn = {0, 0}; > > This variable isn't used/referenced by this patch it probably belongs as > part of the later patch. Yea, just as the usage of origin_lsn in the above compile failure. -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 12-06-13 01:27 PM, Andres Freund wrote: The previous mail contained a patch with a mismerge caused by reording commits. Corrected version attached. Thanks to Steve Singer for noticing this quickly. Attached is a more complete review of this patch. I agree that we will need to identify the node a change originated at. We will not only want this for multi-master support but it might also be very helpful once we introduce things like cascaded replicas. Using a 16 bit integer for this purpose makes sense to me. This patch (with the previous numbered patches already applied), still doesn't compile. gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include -D_GNU_SOURCE -c -o xact.o xact.c xact.c: In function 'xact_redo_commit': xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn' make[4]: *** [xact.o] Error 1 Your complete patch set did compile. origin_lsn gets added as part of your 12'th patch. Managing so many related patches is going to be a pain. but it beats one big patch. I don't think this patch actually requires the origin_lsn change. Code Review - src/backend/utils/misc/guc.c @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] = }, { +{"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER, +gettext_noop("node id for multimaster."), +NULL +}, + &guc_replication_origin_id, +InvalidMultimasterNodeId, InvalidMultimasterNodeId, MaxMultimasterNodeId, +NULL, assign_replication_node_id, NULL +}, I'd rather see us refer to this as the 'node id for logical replication' over the multimaster node id. I think that terminology will be less controversial. Multi-master means different things to different people and it is still unclear what forms of multi-master we will have in-core. For example, most people don't consider slony to be multi-master replication. If a future version of slony were to feed off logical replication (instead of triggers) then I think it would need this node id to determine which node a particular change has come from. The description inside the gettext call should probably be "Sets the node id for ." to be consistent with the description of the rest of the GUC's BootStrapXLOG in xlog.c creates a XLogRecord structure and shouldit set xl_origin_id to the InvalidMultimasterNodeId? WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a well defined value. I think InvalidMultimasterNodeId should be safe even for a no-op record in from a node that actually has a node_id set on real records. backend/replication/logical/logical.c: XLogRecPtr current_replication_origin_lsn = {0, 0}; This variable isn't used/referenced by this patch it probably belongs as part of the later patch. Steve Andres
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
The previous mail contained a patch with a mismerge caused by reording commits. Corrected version attached. Thanks to Steve Singer for noticing this quickly. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From efb28b1f931a30738faac83703810b598a82a6ee Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 9 Jun 2012 14:49:34 +0200 Subject: [PATCH] Introduce the concept that wal has a 'origin' node One solution to avoid loops when doing wal based logical replication in topologies which are more complex than one unidirectional transport is introducing the concept of a 'origin_id' into the wal stream. Luckily there is some padding in the XLogRecord struct that allows us to add that field without further bloating the struct. This solution was chosen because it allows for just about any topology and is inobstrusive. This adds a new configuration parameter multimaster_node_id which determines the id used for wal originating in one cluster. When applying changes from wal from another cluster code can set the variable current_replication_origin_id. This is a global variable because passing it through everything which can generate wal would be far to intrusive. --- src/backend/access/transam/xact.c | 54 + src/backend/access/transam/xlog.c |3 +- src/backend/access/transam/xlogreader.c |2 + src/backend/replication/logical/Makefile |2 +- src/backend/replication/logical/logical.c | 19 + src/backend/utils/misc/guc.c | 19 + src/backend/utils/misc/postgresql.conf.sample |3 ++ src/include/access/xlog.h |4 +- src/include/access/xlogdefs.h |2 + src/include/replication/logical.h | 22 ++ 10 files changed, 100 insertions(+), 30 deletions(-) create mode 100644 src/backend/replication/logical/logical.c create mode 100644 src/include/replication/logical.h diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 3cc2bfa..79ec19a 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -36,8 +36,9 @@ #include "libpq/be-fsstubs.h" #include "miscadmin.h" #include "pgstat.h" -#include "replication/walsender.h" +#include "replication/logical.h" #include "replication/syncrep.h" +#include "replication/walsender.h" #include "storage/lmgr.h" #include "storage/predicate.h" #include "storage/procarray.h" @@ -4545,12 +4546,13 @@ xactGetCommittedChildren(TransactionId **ptr) * actions for which the order of execution is critical. */ static void -xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn, - TransactionId *sub_xids, int nsubxacts, - SharedInvalidationMessage *inval_msgs, int nmsgs, - RelFileNode *xnodes, int nrels, - Oid dbId, Oid tsId, - uint32 xinfo) +xact_redo_commit_internal(TransactionId xid, RepNodeId originating_node, + XLogRecPtr lsn, XLogRecPtr origin_lsn, + TransactionId *sub_xids, int nsubxacts, + SharedInvalidationMessage *inval_msgs, int nmsgs, + RelFileNode *xnodes, int nrels, + Oid dbId, Oid tsId, + uint32 xinfo) { TransactionId max_xid; int i; @@ -4659,8 +4661,8 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn, * Utility function to call xact_redo_commit_internal after breaking down xlrec */ static void -xact_redo_commit(xl_xact_commit *xlrec, - TransactionId xid, XLogRecPtr lsn) +xact_redo_commit(xl_xact_commit *xlrec, RepNodeId originating_node, + TransactionId xid, XLogRecPtr lsn) { TransactionId *subxacts; SharedInvalidationMessage *inval_msgs; @@ -4670,27 +4672,26 @@ xact_redo_commit(xl_xact_commit *xlrec, /* invalidation messages array follows subxids */ inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]); - xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts, - inval_msgs, xlrec->nmsgs, - xlrec->xnodes, xlrec->nrels, - xlrec->dbId, - xlrec->tsId, - xlrec->xinfo); + xact_redo_commit_internal(xid, originating_node, lsn, xlrec->origin_lsn, + subxacts, xlrec->nsubxacts, inval_msgs, + xlrec->nmsgs, xlrec->xnodes, xlrec->nrels, + xlrec->dbId, xlrec->tsId, xlrec->xinfo); } /* * Utility function to call xact_redo_commit_internal for compact form of message. */ static void -xact_redo_commit_compact(xl_xact_commit_compact *xlrec, - TransactionId xid, XLogRecPtr lsn) +xact_redo_commit_compact(xl_xact_commit_compact *xlrec, RepNodeId originating_node, + TransactionId xid, XLogRecPtr lsn) { - xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxact