subject:"Re\: \[HACKERS\] \[PATCH 10\/16\] Introduce the concept that wal has a 'origin' node"

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-29 Thread Andres Freund

On Friday, June 29, 2012 02:43:49 PM Boszormenyi Zoltan wrote:
> 2012-06-19 09:24 keltezéssel, Andres Freund írta:
> > On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
> >> On 12-06-18 07:30 AM, Andres Freund wrote:
> >>> Hrmpf #666. I will go through through the series commit-by-commit again
> >>> to make sure everything compiles again. Reordinging this late
> >>> definitely wasn't a good idea...
> >>> 
> >>> I pushed a rebased version with all those fixups (and removal of the
> >>> zeroRecPtr patch).
> >> 
> >> Where did you push that rebased version to? I don't see an attachment,
> >> or an updated patch in the commitfest app and your repo at
> >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=su
> >> mma ry hasn't been updated in 5 days.
> > 
> > To the 2ndquadrant internal repo. Which strangely doesn't help you.
> > *Headdesk*. Pushed to the correct repo and manually verified.
> 
> Which repository is the correct one?
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git
> was refreshed 11 days ago. The patch taken from there fails with a reject
> in src/include/access/xlog.h.
Thats the right repository but Heikki's recent' changes 
(dfda6ebaec6763090fb78b458a979b558c50b39b and several following) changed quite 
a bit around on master and I am currently rebasing and addressing review 
comments.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-29 Thread Boszormenyi Zoltan


2012-06-19 09:24 keltezéssel, Andres Freund írta:

On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:

On 12-06-18 07:30 AM, Andres Freund wrote:

Hrmpf #666. I will go through through the series commit-by-commit again
to make sure everything compiles again. Reordinging this late definitely
wasn't a good idea...

I pushed a rebased version with all those fixups (and removal of the
zeroRecPtr patch).

Where did you push that rebased version to? I don't see an attachment,
or an updated patch in the commitfest app and your repo at
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa
ry hasn't been updated in 5 days.

To the 2ndquadrant internal repo. Which strangely doesn't help you.
*Headdesk*. Pushed to the correct repo and manually verified.


Which repository is the correct one?
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git
was refreshed 11 days ago. The patch taken from there fails with a reject
in src/include/access/xlog.h.



Andres



--
--
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 10:12:46 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund  
wrote:
> > On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> >> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund 
> > 
> > wrote:
> >> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> >> enough.
> >> >> 
> >> >> C applies the change originally from A (routed through B, because
> >> >> it's faster).  But when it get's the change directly from A, how
> >> >> does it know to *not* apply it again?
> >> > 
> >> > The lsn of the change.
> >> 
> >> So why isn't the LSN good enough for when C propagates the change back
> >> to A?
> > 
> > Because every node has individual progress in the wal so the lsn doesn't
> > mean anything unless you know from which node it originally is.
> > 
> >> Why does A need more information than C?
> > 
> > Didn't I explain that two mails up?
> 
> Probably, but that didn't mean I understood it... I'm trying to keep up
> here ;-)
Heh. Yes. This developed into a huge mess already ;)

> So the origin_id isn't strictly for the origin node to know filter an
> LCR it's applied already, but it is also to correlate the LSN's
> because the LSN's of the re-generated LCR's are meant to contain the
> originating nodes's LSN, and every every node applying LCRs needs to
> be able to know where it is in every node's LSN progress.
There are multiple use-cases for it, this is one of them.

> I had assumed any LCR's generated on a node woudl be relative to the
> LSN sequencing of that node.
We have the original lsn in the commit record. Thats needed to support crash 
recovery because you need to know where to restart applying changes again 
from.

> > Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in
> > north america.
> > #1 wants changes from #3 and #4 when talking to #3 but not those from #2
> >  and itself (because that would be cheaper to get locally).
> 
> Right, but if the link between #1 and #2 ever "slows down", changes
> from #3 and #4 may very well already have #2's changes, and even
> require them.
> 
> #1 has to apply them, or is it going to stop applying LCR's from #3
> when it see's LCRs from #3 coming in that originate on #2 and have
> LSNs greater than what it's so far received from #2?
We will see ;). There are several solutions possible, possibly even depending 
on the use-case. This patch just want's to provide the very basic information 
required to implement such solutions...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:49 PM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
>> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund 
> wrote:
>> >> OK, so in this case, I still don't see how the "origin_id" is even
>> >> enough.
>> >>
>> >> C applies the change originally from A (routed through B, because it's
>> >> faster).  But when it get's the change directly from A, how does it
>> >> know to *not* apply it again?
>> >
>> > The lsn of the change.
>>
>> So why isn't the LSN good enough for when C propagates the change back to
>> A?
> Because every node has individual progress in the wal so the lsn doesn't mean
> anything unless you know from which node it originally is.
>
>> Why does A need more information than C?
> Didn't I explain that two mails up?

Probably, but that didn't mean I understood it... I'm trying to keep up here ;-)

So the origin_id isn't strictly for the origin node to know filter an
LCR it's applied already, but it is also to correlate the LSN's
because the LSN's of the re-generated LCR's are meant to contain the
originating nodes's LSN, and every every node applying LCRs needs to
be able to know where it is in every node's LSN progress.

I had assumed any LCR's generated on a node woudl be relative to the
LSN sequencing of that node.

> Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north
> america.
> #1 wants changes from #3 and #4 when talking to #3 but not those from #2  and
> itself (because that would be cheaper to get locally).

Right, but if the link between #1 and #2 ever "slows down", changes
from #3 and #4 may very well already have #2's changes, and even
require them.

#1 has to apply them, or is it going to stop applying LCR's from #3
when it see's LCRs from #3 coming in that originate on #2 and have
LSNs greater than what it's so far received from #2?

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 09:41:03 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund  
wrote:
> >> OK, so in this case, I still don't see how the "origin_id" is even
> >> enough.
> >> 
> >> C applies the change originally from A (routed through B, because it's
> >> faster).  But when it get's the change directly from A, how does it
> >> know to *not* apply it again?
> > 
> > The lsn of the change.
> 
> So why isn't the LSN good enough for when C propagates the change back to
> A?
Because every node has individual progress in the wal so the lsn doesn't mean 
anything unless you know from which node it originally is.

> Why does A need more information than C?
Didn't I explain that two mails up?

Perhaps Chris' phrasing explains the basic idea better:

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> The case where it would be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
>- This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users

Now imagine a scenario where #1 and #2 are in Europe and #3 and #4 in north 
america.
#1 wants changes from #3 and #4 when talking to #3 but not those from #2  and 
itself (because that would be cheaper to get locally).

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:27 PM, Andres Freund  wrote:

>> OK, so in this case, I still don't see how the "origin_id" is even enough.
>>
>> C applies the change originally from A (routed through B, because it's
>> faster).  But when it get's the change directly from A, how does it
>> know to *not* apply it again?
> The lsn of the change.

So why isn't the LSN good enough for when C propagates the change back to A?

Why does A need more information than C?

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 09:23:34 PM Heikki Linnakangas wrote:
> > And then we just put the originid on each heap record for MMR, in some
> > manner, discussed later.
> 
> I reserve the right to object to that, too :-). Others raised the 
> concern that a 16-bit integer is not a very intuitive identifier. Also, 
> as discussed, for more complex scenarios just the originid is not 
> sufficient. ISTM that we need more flexibility.
I think the '16bit integer is unintiuitive' argument isn't that interesting. 
As pointed out by multiple people in the thread that origin_id can be local 
and mapped to something more complex in communication between the different 
nodes and the configuration.
Before applying changes from another node you lookup their "complex id" into 
the locally mapped 16bit origin_id which then gets written into the wal 
stream. When decoding the wal stream into the LCR stream its mapped the other 
way.

We might need more information than that at a later point but those probably 
won't needed during low-level filtering of wal before reassembling it into 
transactions...

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 09:24:29 PM Aidan Van Dyk wrote:
> On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund  
wrote:
> > To recap why we think origin_id is a sensible design choice:
> > 
> > There are many sensible replication topologies where it does make sense
> > that you want to receive changes (on node C) from one node (say B) that
> > originated from some other node (say A).
> > Reasons include:
> > * the order of applying changes should be as similar as possible on all
> > nodes. That means when applying a change on C that originated on B and
> > if changes replicated faster from A->B than from A->C you want to be at
> > least as far with the replication from A as B was. Otherwise the
> > conflict ratio will increase. If you can recreate the stream from the
> > wal of every node and still detect where an individual change
> > originated, thats easy.
> 
> OK, so in this case, I still don't see how the "origin_id" is even enough.
> 
> C applies the change originally from A (routed through B, because it's
> faster).  But when it get's the change directly from A, how does it
> know to *not* apply it again?
The lsn of the change.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 03:23, Heikki Linnakangas
 wrote:

>> And then we just put the originid on each heap record for MMR, in some
>> manner, discussed later.
>
>
> I reserve the right to object to that, too :-).

OK. But that would be only for MMR, using special record types.

> Others raised the concern
> that a 16-bit integer is not a very intuitive identifier.

Of course

> Also, as
> discussed, for more complex scenarios just the originid is not sufficient.
> ISTM that we need more flexibility.

Of course

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 03:13, Heikki Linnakangas
 wrote:
> On 20.06.2012 21:51, Simon Riggs wrote:
>>
>> On 21 June 2012 02:32, Heikki Linnakangas
>>   wrote:
>>>
>>> I'm not saying that we need to implement all possible conflict resolution
>>>
>>> algorithms right now - on the contrary I think conflict resolution
>>> belongs
>>> outside core
>>
>>
>> It's a pretty standard requirement to have user defined conflict
>> handling, if that's what you mean.
>>
>> I'm OK with conflict handling being outside of core as a module, if
>> that's what people think. We just need a commit hook. That is more
>> likely to allow a workable solution with 9.3 as well, ISTM. It's
>> likely to be contentious...
>
>
> Hmm, what do you need to happen at commit time?
>
> We already have RegisterXactCallback(), if that's enough...

Let me refer back to my notes and discuss this later, I'm just going
from vague memory here, which isn't helpful.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Aidan Van Dyk

On Wed, Jun 20, 2012 at 3:15 PM, Andres Freund  wrote:

> To recap why we think origin_id is a sensible design choice:
>
> There are many sensible replication topologies where it does make sense that
> you want to receive changes (on node C) from one node (say B) that originated
> from some other node (say A).
> Reasons include:
> * the order of applying changes should be as similar as possible on all nodes.
> That means when applying a change on C that originated on B and if changes
> replicated faster from A->B than from A->C you want to be at least as far with
> the replication from A as B was. Otherwise the conflict ratio will increase.
> If you can recreate the stream from the wal of every node and still detect
> where an individual change originated, thats easy.

OK, so in this case, I still don't see how the "origin_id" is even enough.

C applies the change originally from A (routed through B, because it's
faster).  But when it get's the change directly from A, how does it
know to *not* apply it again?




> * the interconnects between some nodes may be more expensive than from others
> * an interconnect between two nodes may fail but others dont
>
> Because of that we think its sensible to be able generate the full LCR stream
> with all changes, local and remote ones, on each individual node. If you then
> can filter on individual origin_id's you can build complex replication
> topologies without much additional complexity.
>
>> I'm not saying that we need to implement all possible conflict
>> resolution algorithms right now - on the contrary I think conflict
>> resolution belongs outside core - but if we're going to change the WAL
>> record format to support such conflict resolution, we better make sure
>> the foundation we provide for it is solid.
> I think this already provides a lot. At some point we probably want to have
> support for looking on which node a certain local xid originated and when that
> was originally executed. While querying that efficiently requires additional
> support we already have all the information for that.
>
> There are some more complexities with consistently determining conflicts on
> changes that happened in a very small timewindown on different nodes but thats
> something for another day.
>
>> BTW, one way to work around the lack of origin id in the WAL record
>> header is to just add an origin-id column to the table, indicating the
>> last node that updated the row. That would be a kludge, but I thought
>> I'd mention it..
> Yuck. The aim is to improve on whats done today ;)
>
> --
>  Andres Freund                     http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 22:11, Simon Riggs wrote:

On 21 June 2012 02:56, Simon Riggs  wrote:


I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.


Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.


I don't think that's actually necessary, the WAL bloat isn't *that* bad 
that we need to start shaving bytes from there. I was just trying to 
make a point.



And then we just put the originid on each heap record for MMR, in some
manner, discussed later.


I reserve the right to object to that, too :-). Others raised the 
concern that a 16-bit integer is not a very intuitive identifier. Also, 
as discussed, for more complex scenarios just the originid is not 
sufficient. ISTM that we need more flexibility.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

Hi,

On Wednesday, June 20, 2012 08:32:53 PM Heikki Linnakangas wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
> > On 20 June 2012 16:23, Heikki Linnakangas
> > 
> >   wrote:
> >> On 20.06.2012 11:17, Simon Riggs wrote:
> >>> On 20 June 2012 15:45, Heikki Linnakangas
> >>> 
> >>> wrote:
>  So, if the origin id is not sufficient for some conflict resolution
>  mechanisms, what extra information do you need for those, and where do
>  you put it?
> >>> 
> >>> As explained elsewhere, wal_level = logical (or similar) would be used
> >>> to provide any additional logical information required.
> >>> 
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> >>> 
> >>> In the case of reflexive updates you raised, a typical response in
> >>> other DBMS would be to represent the query
> >>> 
> >>>UPDATE SET counter = counter + 1
> >>> 
> >>> by sending just the "+1" part, not the current value of counter, as
> >>> would be the case with the non-reflexive update
> >>> 
> >>>UPDATE SET counter = 1
> >>> 
> >>> Handling such things in Postgres would require some subtlety, which
> >>> would not be resolved in first release but is pretty certain not to
> >>> require any changes to the WAL record header as a way of resolving it.
> >>> Having already thought about it, I'd estimate that is a very long
> >>> discussion and not relevant to the OT, but if you wish to have it
> >>> here, I won't stop you.
> >> 
> >> Yeah, I'd like to hear briefly how you would handle that without any
> >> further changes to the WAL record header.
> > 
> > I already did:
> >>> Update and Delete WAL records already need to be different in that
> >>> mode, so additional info would be placed there, if there were any.
> > 
> > The case you mentioned relates to UPDATEs only, so I would suggest
> > that we add that information to a new form of update record only.
> > 
> > That has nothing to do with the WAL record header.
> 
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update
> and delete records.
Imo the whole +1 stuff doesn't have anything to do with the origin_id proposal 
and should be ignored for quite a while. We might go to something like it 
sometime in the future but its nothing we work on (as far as I know ;)).

wal_level=logical (in patch 07) currently only changes the following things 
about the wal stream:

For HEAP_(INSERT|(HOT_)?UPDATE|DELETE)
* prevent full page writes from removing the row data (could be optimized at 
some point to just store the tuple slot)

For HEAP_DELETE
* add the primary key of the changed row

HEAP_MULTI_INSERT obviously needs to get the same treatment in future.

The only real addition that I forsee in the near future is logging the old 
primary key when the primary key changes in HEAP_UPDATE.

Kevin wants an option for full pre-images of rows in HEAP_(UPDATE|DELETE)

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header
> alone? Do you ever need extra information on other record types?
Its needed in some more locations: HEAP_HOT_UPDATE, HEAP2_MULTI_INSERT, 
HEAP_NEWPAGE, HEAP_XACT_(ASSIGN, COMMIT, COMMIT_PREPARED, COMMIT_COMPACT, 
ABORT, ABORT_PREPARED) and probably some I didn't remember right now.

Sure, we can add it to all those but then you need to have individual 
knowledge about *all* of those because the location where its stored will be 
different for each of them.

To recap why we think origin_id is a sensible design choice:

There are many sensible replication topologies where it does make sense that 
you want to receive changes (on node C) from one node (say B) that originated 
from some other node (say A).
Reasons include:
* the order of applying changes should be as similar as possible on all nodes. 
That means when applying a change on C that originated on B and if changes 
replicated faster from A->B than from A->C you want to be at least as far with 
the replication from A as B was. Otherwise the conflict ratio will increase. 
If you can recreate the stream from the wal of every node and still detect 
where an individual change originated, thats easy.
* the interconnects between some nodes may be more expensive than from others
* an interconnect between two nodes may fail but others dont

Because of that we think its sensible to be able generate the full LCR stream 
with all changes, local and remote ones, on each individual node. If you then 
can filter on individual origin_id's you can build complex replication 
topologies without much additional complexity.

> I'm not saying that we need to implement all possible conflict
> resolution algorithms right now - on the contrary I think conflict
> resolution belongs outside core - but if we're going to change the WAL
> record format to support such c

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 21:51, Simon Riggs wrote:

On 21 June 2012 02:32, Heikki Linnakangas
  wrote:

I'm not saying that we need to implement all possible conflict resolution
algorithms right now - on the contrary I think conflict resolution belongs
outside core


It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...


Hmm, what do you need to happen at commit time?

We already have RegisterXactCallback(), if that's enough...


BTW, one way to work around the lack of origin id in the WAL record header
is to just add an origin-id column to the table, indicating the last node
that updated the row. That would be a kludge, but I thought I'd mention it..


err, I hope you mean that to be funny. (It wouldn't actually work either)


No, I wasn't serious that we should implement it that way. But now you 
made me curious; why would it not work? If there's an origin-id column 
in a table, it's included in every heap insert/delete/update WAL record. 
Just set it to the current node's id on a local modification, and to the 
origin's id when replaying changes from another node, and you have the 
exact same information as you would with the extra field in WAL record 
header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 02:56, Simon Riggs  wrote:

> I think allowing rmgrs to redefine the wasted bytes in the header is
> the best idea.

Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.

And then we just put the originid on each heap record for MMR, in some
manner, discussed later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 02:45, Heikki Linnakangas
 wrote:
> On 20.06.2012 16:46, Simon Riggs wrote:
>>
>> The proposal now includes flag bits that would allow the addition of a
>> variable length header, should that ever become necessary. So the
>> unused space in the fixed header is not being "used up" as you say. In
>> any case, the fixed header still has 4 wasted bytes on 64bit systems
>> even after the patch is applied. So this claim of short sightedness is
>> just plain wrong.
>>
>> ...
>
>>
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs. It is also possible to add this to HEAP
>> and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
>> using the *currently wasted* bytes on a WAL record header does *not*
>> bloat the WAL stream.
>

Wonderful ideas, these look good.

> Or, we could provide a mechanism for resource managers to use those padding
> bytes for whatever data the wish to use.

Sounds better to me.

> Or modify the record format so that
> the last 4 bytes of the data in the WAL record are always automatically
> stored in those padding bytes, thus making all WAL records 4 bytes shorter.
> That would make the WAL even more compact, with only a couple of extra CPU
> instructions in the critical path.

Sounds cool, but a little weird, even for me.

> My point is that it's wrong to think that it's free to use those bytes, just
> because they're currently unused. If we use them for one thing, we can't use
> them for other things anymore. If we're so concerned about WAL bloat that we
> can't afford to add any more bytes to the WAL record header or heap WAL
> records, then it would be equally fruitful to look at ways to use those
> padding bytes to save that precious WAL space.

Agreed. Thanks for sharing those ideas. Exactly why I like the list (really...)

> I don't think we're *that* concerned about the WAL bloat, however. So let's
> see what is the most sensible place to add whatever extra information we
> need in the WAL, from the point of view of maintainability, flexibility,
> readability etc. Then we can decide where to put it.

Removing FPW is still most important aspect there.

I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 02:32, Heikki Linnakangas
 wrote:
> On 20.06.2012 17:35, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>>   wrote:
>>>
>>> On 20.06.2012 11:17, Simon Riggs wrote:


 On 20 June 2012 15:45, Heikki Linnakangas
     wrote:
>
>
> So, if the origin id is not sufficient for some conflict resolution
> mechanisms, what extra information do you need for those, and where do
> you put it?


 As explained elsewhere, wal_level = logical (or similar) would be used
 to provide any additional logical information required.

 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.

 In the case of reflexive updates you raised, a typical response in
 other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
 by sending just the "+1" part, not the current value of counter, as
 would be the case with the non-reflexive update
   UPDATE SET counter = 1

 Handling such things in Postgres would require some subtlety, which
 would not be resolved in first release but is pretty certain not to
 require any changes to the WAL record header as a way of resolving it.
 Having already thought about it, I'd estimate that is a very long
 discussion and not relevant to the OT, but if you wish to have it
 here, I won't stop you.
>>>
>>>
>>>
>>> Yeah, I'd like to hear briefly how you would handle that without any
>>> further
>>> changes to the WAL record header.
>>
>>
>> I already did:
>>
 Update and Delete WAL records already need to be different in that
 mode, so additional info would be placed there, if there were any.
>>
>>
>> The case you mentioned relates to UPDATEs only, so I would suggest
>> that we add that information to a new form of update record only.
>>
>> That has nothing to do with the WAL record header.
>
>
> Hmm, so you need the origin id in the WAL record header to do filtering.
> Except when that's not enough, you add some more fields to heap update and
> delete records.

Yes

> Don't you think it would be simpler to only add the extra fields to heap
> insert, update and delete records, and leave the WAL record header alone? Do
> you ever need extra information on other record types?

No extra info on other record types, in general at least. Doing it
that way is the most logical place, just not the most efficient.

> The other question is, *what* information do you need to put there? We don't
> necessarily need to have a detailed design of that right now, but it seems
> quite clear to me that we need more flexibility there than this patch
> provides, to support more complicated conflict resolution.

Another patch already covers that for the common non-reflexive case.
More on conflict resolution soon, I guess. I was going to begin
working on that in July. Starting with the design discussed on list,
of course. This patch has thrown up stuff I thought was
compartmentalised, but I was wrong.

> I'm not saying that we need to implement all possible conflict resolution
> algorithms right now - on the contrary I think conflict resolution belongs
> outside core

It's a pretty standard requirement to have user defined conflict
handling, if that's what you mean.

I'm OK with conflict handling being outside of core as a module, if
that's what people think. We just need a commit hook. That is more
likely to allow a workable solution with 9.3 as well, ISTM. It's
likely to be contentious...

> - but if we're going to change the WAL record format to support
> such conflict resolution, we better make sure the foundation we provide for
> it is solid.

Agreed

> BTW, one way to work around the lack of origin id in the WAL record header
> is to just add an origin-id column to the table, indicating the last node
> that updated the row. That would be a kludge, but I thought I'd mention it..

err, I hope you mean that to be funny. (It wouldn't actually work either)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas

On 20.06.2012 16:46, Simon Riggs wrote:

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being "used up" as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

> ...
>

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

Or, we could provide a mechanism for resource managers to use those 
padding bytes for whatever data the wish to use. Or modify the record 
format so that the last 4 bytes of the data in the WAL record are always 
automatically stored in those padding bytes, thus making all WAL records 
4 bytes shorter. That would make the WAL even more compact, with only a 
couple of extra CPU instructions in the critical path.

My point is that it's wrong to think that it's free to use those bytes, 
just because they're currently unused. If we use them for one thing, we 
can't use them for other things anymore. If we're so concerned about WAL 
bloat that we can't afford to add any more bytes to the WAL record 
header or heap WAL records, then it would be equally fruitful to look at 
ways to use those padding bytes to save that precious WAL space.

I don't think we're *that* concerned about the WAL bloat, however. So 
let's see what is the most sensible place to add whatever extra 
information we need in the WAL, from the point of view of 
maintainability, flexibility, readability etc. Then we can decide where 
to put it.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 07:50:37 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund  
wrote:
> >> I realized a problem with that idea this morning: it might work for
> >> reading things, but if anyone attempts to write data you've got big
> >> problems.  Maybe we could get away with forbidding that, not sure.
> > 
> > Hm, why is writing a problem? You mean io conversion routines writing
> > data? Yes, that will be a problem. I am fine with simply forbidding
> > that, we should be able to catch that and provide a sensible error
> > message, since SSI we have the support for that.
> I think we could do something a little more vigorous than that, like
> maybe error out if anyone tries to do anything that would write WAL or
> acquire an XID.
I would go for all of them ;). The read only transaction warnings will 
probably result in the best error messages.

> Of course, then the question becomes: then what?  We
> probably need to think about what happens after that - we don't want
> an error replicating one row in one table to permanently break
> replication for the entire system.
Interesting problem, yes. My first reaction was to just to warn and log and 
throw the transaction away... But thats not likely to work very well on the 
apply side...

I don't think its a particularly likely problem though. An io function doing 
that would probably fail in lots of other situations (HS, read only 
transactions, permission problems).

> >> Sorry.  I don't think you're planning to do something evil, but before
> >> I thought you said you did NOT want to write the code to extract
> >> changes as text or something similar.
> > Hm. I might have been a bit ambiguous when saying that I do not want to
> > provide everything for that use-case.
> > Once we have a callpoint that has a correct catalog snapshot for exactly
> > the tuple in question text conversion is damn near trivial. The point
> > where you get passed all that information (action, tuple, table,
> > snapshot) is the one I think the patch should mainly provide.
> This is actually a very interesting list.  We could rephrase the
> high-level question about the design of this feature as "what is the
> best way to make sure that you have these things available?".  Action
> and tuple are trivial to get, and table isn't too hard either.  It's
> really the snapshot - and all the downstream information that can only
> be obtained via using that snapshot - that is the hard part.
For others, a sensible entry point into this discussion before switching 
subthreads probably is in http://archives.postgresql.org/message-
id/201206192023.20589.and...@2ndquadrant.com

The table isn't as easy as you might think as the wal record only contains the 
relfilenode. Unless you want to log more its basically solved by solving the 
snapshot issue.

And yes, agreeing on how to do that is the thing we need to solve next. Patch 
07 (add enough information to wal) and this are the first ones that should get 
committed imo. And be ready for that obviously.

Just noticed that I failed to properly add Patch 07 to the commitfest. I have 
done so now, hope thats ok.


Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 17:35, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
  wrote:

On 20.06.2012 11:17, Simon Riggs wrote:


On 20 June 2012 15:45, Heikki Linnakangas
wrote:


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do
you put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.



Yeah, I'd like to hear briefly how you would handle that without any further
changes to the WAL record header.


I already did:


Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.


The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.


Hmm, so you need the origin id in the WAL record header to do filtering. 
Except when that's not enough, you add some more fields to heap update 
and delete records.


Don't you think it would be simpler to only add the extra fields to heap 
insert, update and delete records, and leave the WAL record header 
alone? Do you ever need extra information on other record types?


The other question is, *what* information do you need to put there? We 
don't necessarily need to have a detailed design of that right now, but 
it seems quite clear to me that we need more flexibility there than this 
patch provides, to support more complicated conflict resolution.


I'm not saying that we need to implement all possible conflict 
resolution algorithms right now - on the contrary I think conflict 
resolution belongs outside core - but if we're going to change the WAL 
record format to support such conflict resolution, we better make sure 
the foundation we provide for it is solid.


BTW, one way to work around the lack of origin id in the WAL record 
header is to just add an origin-id column to the table, indicating the 
last node that updated the row. That would be a kludge, but I thought 
I'd mention it..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 01:06, Christopher Browne  wrote:

> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
>
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure.  In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.

Unfortunately you do. That's really the core of how this differs from
current Slony.

Every change we make creates WAL records. Whether that is changes
originating on the current node, or changes originating on upstream
nodes that need to be applied on the current node.

The WAL stream is then read and filtered for changes to pass onto
other nodes. So we want to be able to filter out the applied changes
to avoid passing them back to the original nodes.

Having each record know the origin makes the filtering much simpler,
so if its possible to do it efficiently then its the best design. It
turns out to be the best way to do this so far known. There are other
design however, as noted. In all cases we need the origin id in the
WAL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

Hi Chris!

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund  
wrote:
> > On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> >> Simon Riggs  wrote:
> >> > This is not transaction metadata, it is WAL record metadata
> >> > required for multi-master replication, see later point.
> >> > 
> >> > We need to add information to every WAL record that is used as the
> >> > source for generating LCRs.
> >> 
> >> If the origin ID of a transaction doesn't count as transaction
> >> metadata (i.e., data about the transaction), what does?  It may be a
> >> metadata element about which you have special concerns, but it is
> >> transaction metadata.  You don't plan on supporting individual WAL
> >> records within a transaction containing different values for origin
> >> ID, do you?  If not, why is it something to store in every WAL
> >> record rather than once per transaction?  That's not intended to be
> >> a rhetorical question.
> > 
> > Its definitely possible to store it per transaction (see the discussion
> > around http://archives.postgresql.org/message-
> > id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering
> > via the originating node a considerably more complex thing. With our
> > proposal you can do it without any complexity involved, on a low level.
> > Storing it per transaction means you can only stream out the data to
> > other nodes *after* fully reassembling the transaction. Thats a pitty,
> > especially if we go for a design where the decoding happens in a proxy
> > instance.
> 
> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
> 
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure.  In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.
> 
> It seems as though the point in time at which you need to know the
> origin ID is the moment at which you're deciding to read data from the
> WAL files, and knowing which stream you are reading from is an
> assertion that might be satisfied by looking at configuration that
> doesn't need to be in the WAL stream itself.  It might be *nice* for
> the WAL stream to be self-identifying, but that doesn't seem to be
> forcibly necessary.
> 
> The case where it *would* be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
>- This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users
Exactly that is the point. And you want to do that in an efficient manner 
without too much logic, thats why something simple like the record header is 
so appealing.

> > I also have to admit that I am very hesitant to start developing some
> > generic "transaction metadata" framework atm. That seems to be a good
> > way to spend a good part of time in discussion and disagreeing. Imo
> > thats something for later.

> Well, I see there being a use in there being at least 3 sorts of LCR
> records:
> a) Capturing literal SQL that is to replayed downstream
> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> c) Capturing tuple data in some reasonably portable and readily
> re-writable form
I think we should provide the utilities to do all of those. a) is a 
consequence of being able to do c). 

That doesn't really have something to do with this subthread though? The part 
you quoted above was my response to the suggestion to add some generic 
framework to attach metadata to individual transactions on the generating 
side. We quite possibly will end up needing that but I personally don't think 
we should designing that part atm.

> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> Unfortunately, this form is likely not to play well when
> replicating across platforms or Postgres versions, so I suspect that
> this performance optimization should be implemented as a *last*
> resort, rather than first.  Michael Jackson had some "rules of
> optimization" that said "don't do it", and, for the expert, "don't do
> it YET..."
Well, apply is a bottleneck. Besides field experience I/We have benchmarked it 
and its rather plausible that it is. And I don't think we can magically make 
that faster in pg in general so my plan is to remove the biggest cost factor I 
can see.
And yes, it will have restrictions...

Regards,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 21 June 2012 01:40, Andres Freund  wrote:

>> I think extraction is a very sensible place to start; actually, I
>> think it's the best possible place to start.  But this particular
>> thread is about adding origin_ids to WAL, which I think is definitely
>> not the best place to start.

> Yep. I think the reason everyone started at it is that the patch was actually
> really simple ;).
> Note that the wal enrichement & decoding patches were before the origin_id
> patch in the patchseries ;)

Actually, I thought it would be non contentious...

A format change, so early in release cycle. Already openly discussed
with no comment. Using wasted space for efficiency, for information
already in use by Slony for similar reasons.

Oh well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 1:40 PM, Andres Freund  wrote:
>> I realized a problem with that idea this morning: it might work for
>> reading things, but if anyone attempts to write data you've got big
>> problems.  Maybe we could get away with forbidding that, not sure.
> Hm, why is writing a problem? You mean io conversion routines writing data?
> Yes, that will be a problem. I am fine with simply forbidding that, we should
> be able to catch that and provide a sensible error message, since SSI we have
> the support for that.

I think we could do something a little more vigorous than that, like
maybe error out if anyone tries to do anything that would write WAL or
acquire an XID.  Of course, then the question becomes: then what?  We
probably need to think about what happens after that - we don't want
an error replicating one row in one table to permanently break
replication for the entire system.

>> Sorry.  I don't think you're planning to do something evil, but before
>> I thought you said you did NOT want to write the code to extract
>> changes as text or something similar.
> Hm. I might have been a bit ambiguous when saying that I do not want to
> provide everything for that use-case.
> Once we have a callpoint that has a correct catalog snapshot for exactly the
> tuple in question text conversion is damn near trivial. The point where you
> get passed all that information (action, tuple, table, snapshot) is the one I
> think the patch should mainly provide.

This is actually a very interesting list.  We could rephrase the
high-level question about the design of this feature as "what is the
best way to make sure that you have these things available?".  Action
and tuple are trivial to get, and table isn't too hard either.  It's
really the snapshot - and all the downstream information that can only
be obtained via using that snapshot - that is the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 23:56, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

 OK, so your estimate is 1.5 people to do that. And if we have more
 people, should they sit around doing nothing?
>>>
>>> Oh, give me a break.  You're willfully missing my point.  And to quote
>>> Fred Brooks, nine women can't make a baby in one month.
>>
>> No, I'm not. The question is not how quickly can N people achieve a
>> single thing, but how long will it take a few skilled people working
>> on carefully selected tasks that have few dependencies between them to
>> achieve something.
>
> The bottleneck is getting the design right, not writing the code.
> Selecting tasks for people to work on without an agreement on the
> design will not advance the process, unless we just accept whatever
> code you choose to right based on whatever design you happen to pick.

Right. Which is why there are multiple people doing design on
different aspects and much work going into prototyping to give some
useful information into debates that would otherwise be bikeshedding.

>> How exactly did you arrive at your conclusion? Why is yours right and
>> mine wrong?
>
> I estimated the amount of work that would be required to do this right
> and compared it to other large projects that have been successfully
> done in the past.  I think you are looking at something on the order
> of magnitude of the Windows port, which took about four releases to
> become stable, or the SE-Linux project, which still isn't
> feature-complete.  Even if it's only a HS-sized project, that took two
> releases, as did streaming replication.  SSI got committed within one
> release cycle, but there were several years of design and advocacy
> work before any code was written, so that, too, was really a
> multi-year project.

The current BDR project uses concepts that have been in discussion for
many years, something like 6-7 years. I first wrote multi-master for
Postgres in userspace in 2007 and the current designs build upon that.
My detailed design for WAL-based logical replication was produced in
2009. I'm not really breaking new ground in the design, and we're
reusing significant parts of existing code. You don't know that cos
you didn't think to ask, which is a little surprising.

On top of that, Andres has refined many of the basic ideas and what he
presents here is a level above my initial design work. I have every
confidence that he'll work independently and produce useful code while
other aspects are considered.

> I'll confine my comments on the second part of the question to the
> observation that it is a bit early to know who is right and who is
> wrong, but the question could just as easily be turned on its head.
>
>> No, I have four people who had initial objections and who have not
>> commented on the fact that the points made are regrettably incorrect.
>
> I think Kevin addressed this point better than I can.  Asserting
> something doesn't make it true, and you haven't offered any rational
> argument against the points that have been made, probably because
> there isn't one.  We *cannot* justify steeling 100% of the available
> bit space for a feature that many people won't use and may not be
> enough to address the real requirement anyway.

You've some assertions above that aren't correct, so its good that you
recognise assertions may not be true, on both sides. I ask that you
listen a little before casting judgements. We won't get anywhere
otherwise.

The current wasted space on 64bit systems is 6 bytes per record. The
current suggestion is to use 2 of those, in a flexible manner that
allows future explansion if it is required, and only if it is
required. That flexibility covers anything else we need. Asserting
that "100% of the available space" is being used is wrong, and to
continue to assert that even when told otherwise multiple times is
something else.

>> Since at least 3 of the people making such comments did not attend the
>> full briefing meeting in Ottawa, I am not particularly surprised.
>> However, I do expect people that didn't come to the meeting to
>> recognise that they are likely to be missing information and to listen
>> closely, as I listen to them.
>
> Participation in the community development process is not contingent
> on having flown to Ottawa in May, or on having decided to spend that
> evening at your briefing meeting.  Attributing to ignorance what is
> adequately explained by honest disagreement is impolite.

I don't think people need to have attended that briefing. But if they
did not, then they do need to be aware they missed hours of community
discussion and explanation, built on top of months of careful
investigation, all of which was aimed at helping everybody understand.

Jumping straight into a discussion is what the community is about

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 07:17:57 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  
wrote:
> > I would prefer the event trigger way because that seems to be the most
> > extensible/reusable. It would allow a fully replicated databases and
> > catalog only instances.
> > I think we need to design event triggers in a way you cannot simply
> > circumvent them. We already have the case that if users try to screw
> > around system triggers we give back wrong answers with the planner
> > relying on foreign keys btw.
> > If the problem is having user trigger after system triggers: Lets make
> > that impossible. Forbidding DDL on the other instances once we have that
> > isn't that hard.
> 
> So, this is interesting.  I think something like this could solve the
> problem, but then why not just make it built-in code that runs from
> the same place as the event trigger rather than using the trigger
> mechanism per se?  Presumably the "trigger" code's purpose is going to
> be to inject additional data into the WAL stream (am I wrong?) which
> is not something you're going to be able to do from PL/pgsql anyway,
> so you don't really need a trigger, just a call to some C function -
> which not only has the advantage of being not bypassable, but is also
> faster.
I would be totally fine with that. As long as event triggers provide the 
infrastructure that shouldn't be a big problem.

> > Perhaps all that will get simpler if we can make reading the catalog via
> > custom built snapshots work as you proposed otherwhere in this thread.
> > That would make checking errors way much easier even if you just want to
> > apply to a database with exactly the same schema. Thats the next thing I
> > plan to work on.
> I realized a problem with that idea this morning: it might work for
> reading things, but if anyone attempts to write data you've got big
> problems.  Maybe we could get away with forbidding that, not sure.
Hm, why is writing a problem? You mean io conversion routines writing data? 
Yes, that will be a problem. I am fine with simply forbidding that, we should 
be able to catch that and provide a sensible error message, since SSI we have 
the support for that.

> Would be nice to get some input from other hackers on this.
Oh, yes!


> > I agree that the focus isn't 100% optimal and that there are *loads* of
> > issues we haven't event started to look at. But you need a point to
> > start and extraction & apply seems to be a good one because you can
> > actually test it without the other issues solved which is not really the
> > case the other way round.
> > Also its possible to plug in the newly built changeset extraction into
> > existing solutions to make them more efficient while retaining most of
> > their respective framework.
> > 
> > So I disagree that thats the wrong part to start with.
> 
> I think extraction is a very sensible place to start; actually, I
> think it's the best possible place to start.  But this particular
> thread is about adding origin_ids to WAL, which I think is definitely
> not the best place to start.
Yep. I think the reason everyone started at it is that the patch was actually 
really simple ;).
Note that the wal enrichement & decoding patches were before the origin_id 
patch in the patchseries ;)

> > I definitely do want to provide code that generates a textual
> > representation of the changes. As you say, even if its not used for
> > anything its needed for debugging. Not sure if it should be sql or maybe
> > the new slony representation. If thats provided and reusable it should
> > make sure that ontop of that other solutions can be built.
> Oh, yeah.  If we can get that, I will throw a party.
Good ;)

> > I find your supposition that I/we just want to get MMR without regard for
> > anything else a bit offensive. I wrote at least three times in this
> > thread that I do think its likely that we will not get more than the
> > minimal basis for implementing MMR into 9.3. I wrote multiple times that
> > I want to provide the basis for multiple solutions. The prototype -
> > while obviously being incomplete - tried hard to be modular.
> > You cannot blame us that we want the work we do to be *also* usable for
> > what one of our major aims?
> > What can I do to convince you/others that I am not planning to do
> > something "evil" but that I try to reach as many goals at once as
> > possible?
> Sorry.  I don't think you're planning to do something evil, but before
> I thought you said you did NOT want to write the code to extract
> changes as text or something similar.
Hm. I might have been a bit ambiguous when saying that I do not want to 
provide everything for that use-case.
Once we have a callpoint that has a correct catalog snapshot for exactly the 
tuple in question text conversion is damn near trivial. The point where you 
get passed all that information (action, tuple, table, snapshot) is the one I 
think the patch should mainly provide.

> I think that

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 12:53 PM, Andres Freund  wrote:
> I would prefer the event trigger way because that seems to be the most
> extensible/reusable. It would allow a fully replicated databases and catalog
> only instances.
> I think we need to design event triggers in a way you cannot simply circumvent
> them. We already have the case that if users try to screw around system
> triggers we give back wrong answers with the planner relying on foreign keys
> btw.
> If the problem is having user trigger after system triggers: Lets make that
> impossible. Forbidding DDL on the other instances once we have that isn't that
> hard.

So, this is interesting.  I think something like this could solve the
problem, but then why not just make it built-in code that runs from
the same place as the event trigger rather than using the trigger
mechanism per se?  Presumably the "trigger" code's purpose is going to
be to inject additional data into the WAL stream (am I wrong?) which
is not something you're going to be able to do from PL/pgsql anyway,
so you don't really need a trigger, just a call to some C function -
which not only has the advantage of being not bypassable, but is also
faster.

> Perhaps all that will get simpler if we can make reading the catalog via
> custom built snapshots work as you proposed otherwhere in this thread. That
> would make checking errors way much easier even if you just want to apply to
> a database with exactly the same schema. Thats the next thing I plan to work
> on.

I realized a problem with that idea this morning: it might work for
reading things, but if anyone attempts to write data you've got big
problems.  Maybe we could get away with forbidding that, not sure.
Would be nice to get some input from other hackers on this.

>> They could also modify the catalogs directly, although it's possible we
>> don't care quite as much about that case (but on the other hand people
>> do sometimes need to do it to solve real problems).
> With that you already can crash the database perfectly fine today. I think
> trying to care for that is a waste of time.

You're probably right.

> I agree that the focus isn't 100% optimal and that there are *loads* of issues
> we haven't event started to look at. But you need a point to start and
> extraction & apply seems to be a good one because you can actually test it
> without the other issues solved which is not really the case the other way
> round.
> Also its possible to plug in the newly built changeset extraction into
> existing solutions to make them more efficient while retaining most of their
> respective framework.
>
> So I disagree that thats the wrong part to start with.

I think extraction is a very sensible place to start; actually, I
think it's the best possible place to start.  But this particular
thread is about adding origin_ids to WAL, which I think is definitely
not the best place to start.

> I definitely do want to provide code that generates a textual representation
> of the changes. As you say, even if its not used for anything its needed for
> debugging. Not sure if it should be sql or maybe the new slony representation.
> If thats provided and reusable it should make sure that ontop of that other
> solutions can be built.

Oh, yeah.  If we can get that, I will throw a party.

> I find your supposition that I/we just want to get MMR without regard for
> anything else a bit offensive. I wrote at least three times in this thread
> that I do think its likely that we will not get more than the minimal basis
> for implementing MMR into 9.3. I wrote multiple times that I want to provide
> the basis for multiple solutions. The prototype - while obviously being
> incomplete - tried hard to be modular.
> You cannot blame us that we want the work we do to be *also* usable for what
> one of our major aims?
> What can I do to convince you/others that I am not planning to do something
> "evil" but that I try to reach as many goals at once as possible?

Sorry.  I don't think you're planning to do something evil, but before
I thought you said you did NOT want to write the code to extract
changes as text or something similar.  I think that would be a really
bad thing to skip for all kinds of reasons.  I think we need that as a
foundational technology before we do much else.  Now, once we have
that, if we can safely detect cases where it's OK to bypass decoding
to text and skip it in just those cases, I think that's great
(although possibly difficult to implement correctly).  I basically
feel that without decode-to-text, this can't possibly be a basis for
multiple solutions; it will be a basis only for itself, and extremely
difficult to debug, too.  No other replication solution can even
theoretically have any use for the raw on-disk tuple, at least not
without horrible kludgery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To ma

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Christopher Browne

On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund  wrote:
> On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
>> Simon Riggs  wrote:
>> > This is not transaction metadata, it is WAL record metadata
>> > required for multi-master replication, see later point.
>
>> > We need to add information to every WAL record that is used as the
>> > source for generating LCRs.
>> If the origin ID of a transaction doesn't count as transaction
>> metadata (i.e., data about the transaction), what does?  It may be a
>> metadata element about which you have special concerns, but it is
>> transaction metadata.  You don't plan on supporting individual WAL
>> records within a transaction containing different values for origin
>> ID, do you?  If not, why is it something to store in every WAL
>> record rather than once per transaction?  That's not intended to be
>> a rhetorical question.
> Its definitely possible to store it per transaction (see the discussion around
> http://archives.postgresql.org/message-
> id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via
> the originating node a considerably more complex thing. With our proposal you
> can do it without any complexity involved, on a low level. Storing it per
> transaction means you can only stream out the data to other nodes *after*
> fully reassembling the transaction. Thats a pitty, especially if we go for a
> design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure.  In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself.  It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
   - This INSERT was replicated from node #1, so should be ignored downstream
   - That INSERT was replicated from node #2, so should be ignored downstream
   - This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

> Other metadata will not be needed on such a low level.
>
> I also have to admit that I am very hesitant to start developing some generic
> "transaction metadata" framework atm. That seems to be a good way to spend a
> good part of time in discussion and disagreeing. Imo thats something for
> later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream.  This
parallels two use cases existing in existing replication systems:
   i) In pre-2.2 versions of Slony, statements are replayed literally.
 So there's a stream of INSERT/UPDATE/DELETE statements.
   ii) DDL capture and replay.  In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
   There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first.  Michael Jackson had some "rules of
optimization" that said "don't do it", and, for the expert, "don't do
it YET..."

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from "SQL fragments" (of a) i) above) to storing
updates as an array of text values indicating:
 - relation name
 - attribute names
 - attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today.  So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 23:34, Kevin Grittner  wrote:
> Simon Riggs  wrote:
>> Kevin Grittner  wrote:
 Heikki Linnakangas  wrote:
>>>
 I don't like the idea of adding the origin id to the record
 header. It's only required in some occasions, and on some record
 types.
>>>
>>> Right.
>>
>> Wrong, as explained.
>
> The point is not wrong; you are simply not responding to what is
> being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.

> You have not explained why an origin ID is required when there is no
> replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.

>> This is not transaction metadata, it is WAL record metadata
>> required for multi-master replication, see later point.
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs.
>
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.  I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.

> As long as the WAL stream is the medium, options are very
> constrained.  You can code a very fast engine to handle a single
> type of configuration that way, and perhaps that should be a
> supported feature, but it's not a configuration I've needed yet.
> (Well, on reflection, if it had been available and easy to use, I
> can think of *one* time I *might* have used it for a pair of nodes.)
> It seems to me that you are so focused on this one use case that you
> are not considering how design choices which facilitate fast
> development of that use case paint us into a corner in terms of
> expanding to other use cases.

>>> I think removing origin ID from this patch and submitting a
>>> separate patch for a generalized transaction metadata system is
>>> the sensible way to go.
>>
>> We already have a very flexible WAL system for recording data of
>> interest to various resource managers. If you wish to annotate a
>> transaction, you can either generate a new kind of WAL record or
>> you can enhance a commit record.
>
> Right.  Like many of us are suggesting should be done for origin ID.
>
>> XLOG_NOOP records can already be generated by your application if
>> you wish to inject additional metadata to the WAL stream. So no
>> changes are required for you to implement the generalised
>> transaction metadata scheme you say you require.
>
> I'm glad it's that easy.  Are there SQL functions to for that yet?

Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.

>> Not sure how or why that relates to requirements for multi-master.
>
> That depends on whether you want to leave the door open to other
> logical replication than the one use case on which you are currently
> focused.  I even consider some of those other cases multi-master,
> especially when multiple databases are replicating to a single table
> on another server.  I'm not clear on your definition -- it seems to
> be rather more narrow.  Maybe we need to define some terms somewhere
> to facilitate discussion.  Is there a Wiki page where that would
> make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding origin

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

Hi,

On Wednesday, June 20, 2012 05:44:09 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  
wrote:
> > Were not the only ones here that are performing scope creep though... I
> > think about all people who have posted in the whole thread except maybe
> > Tom and Marko are guilty of doing so.
> > 
> > I still think its rather sensible to focus on exactly duplicated schemas
> > in a very first version just because that leaves out some of the
> > complexity while paving the road for other nice things.
> 
> Well, I guess what I want to know is: what does focusing on exactly
> duplicated schemas mean?  If it means we'll disable DDL for tables
> when we turn on replication, that's basically the Slony approach: when
> you want to make a DDL change, you have to quiesce replication, do it,
> and then resume replication.  I would possibly be OK with that
> approach.  If it means that we'll hope that the schemas are duplicated
> and start spewing garbage data when they're not, then I'm not
> definitely not OK with that approach.  If it means using event
> triggers to keep the catalogs synchronized, then I don't think I don't
> think that's adequately robust.  The user could add more event
> triggers that run before or after the ones the replication system
> adds, and then you are back to garbage decoding (or crashes). 
I would prefer the event trigger way because that seems to be the most 
extensible/reusable. It would allow a fully replicated databases and catalog 
only instances.
I think we need to design event triggers in a way you cannot simply circumvent 
them. We already have the case that if users try to screw around system 
triggers we give back wrong answers with the planner relying on foreign keys 
btw.
If the problem is having user trigger after system triggers: Lets make that 
impossible. Forbidding DDL on the other instances once we have that isn't that 
hard.

Perhaps all that will get simpler if we can make reading the catalog via 
custom built snapshots work as you proposed otherwhere in this thread. That 
would make checking errors way much easier even if you just want to apply to  
a database with exactly the same schema. Thats the next thing I plan to work 
on.

> They could also modify the catalogs directly, although it's possible we
> don't care quite as much about that case (but on the other hand people
> do sometimes need to do it to solve real problems).
With that you already can crash the database perfectly fine today. I think 
trying to care for that is a waste of time.

> Although I am
> 100% OK with pairing back the initial feature set - indeed, I strongly
> recommend it - I think that robustness is not a feature which can be
> left out in v1 and added in later.  All the robustness has to be
> designed in at the start, or we will never have it.
I definitely don't intend to cut down on robustness.

> On the whole, I think we're spending far too much time talking about
> code and far too little time talking about what the overall design
> should look like.
Agreed.

> We are having a discussion about whether or not MMR
> should be supported by sticking a 16-bit node ID into every WAL record
> without having first decided whether we should support MMR, whether
> that requires node IDs, whether they should be integers, whether those
> integers should be 16 bits in size, whether they should be present in
> WAL, and whether or not the record header is the right place to put
> them.  There's a right order in which to resolve those questions, and
> this isn't it.  More generally, I think there is a ton of complexity
> that we're probably overlooking here in focusing in on specific coding
> details.  I think the most interesting comment made to date is Steve
> Singer's observation that very little of Slony is concerned with
> changeset extraction or apply.  Now, on the flip side, all of these
> patches seem to be concerned with changeset extraction and apply.
> That suggests that we're missing some pretty significant pieces
> somewhere in this design.  I think those pieces are things like error
> recovery, fault tolerance, user interface design, and control logic.
> Slony has spent years trying to get those things right.  Whether or
> not they actually have gotten them right is of course an arguable
> point, but we're unlikely to do better by ignoring all of those issues
> and implementing whatever is most technically expedient.
I agree that the focus isn't 100% optimal and that there are *loads* of issues 
we haven't event started to look at. But you need a point to start and 
extraction & apply seems to be a good one because you can actually test it 
without the other issues solved which is not really the case the other way 
round.
Also its possible to plug in the newly built changeset extraction into 
existing solutions to make them more efficient while retaining most of their 
respective framework.

So I disagree that thats the wrong part to start with.

> >> You've got f

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 10:08 AM, Simon Riggs  wrote:
 But I think getting even
 single-master logical replication working well in a single release
 cycle is going to be a job and a half.
>>>
>>> OK, so your estimate is 1.5 people to do that. And if we have more
>>> people, should they sit around doing nothing?
>>
>> Oh, give me a break.  You're willfully missing my point.  And to quote
>> Fred Brooks, nine women can't make a baby in one month.
>
> No, I'm not. The question is not how quickly can N people achieve a
> single thing, but how long will it take a few skilled people working
> on carefully selected tasks that have few dependencies between them to
> achieve something.

The bottleneck is getting the design right, not writing the code.
Selecting tasks for people to work on without an agreement on the
design will not advance the process, unless we just accept whatever
code you choose to right based on whatever design you happen to pick.

> How exactly did you arrive at your conclusion? Why is yours right and
> mine wrong?

I estimated the amount of work that would be required to do this right
and compared it to other large projects that have been successfully
done in the past.  I think you are looking at something on the order
of magnitude of the Windows port, which took about four releases to
become stable, or the SE-Linux project, which still isn't
feature-complete.  Even if it's only a HS-sized project, that took two
releases, as did streaming replication.  SSI got committed within one
release cycle, but there were several years of design and advocacy
work before any code was written, so that, too, was really a
multi-year project.

I'll confine my comments on the second part of the question to the
observation that it is a bit early to know who is right and who is
wrong, but the question could just as easily be turned on its head.

> No, I have four people who had initial objections and who have not
> commented on the fact that the points made are regrettably incorrect.

I think Kevin addressed this point better than I can.  Asserting
something doesn't make it true, and you haven't offered any rational
argument against the points that have been made, probably because
there isn't one.  We *cannot* justify steeling 100% of the available
bit space for a feature that many people won't use and may not be
enough to address the real requirement anyway.

> Since at least 3 of the people making such comments did not attend the
> full briefing meeting in Ottawa, I am not particularly surprised.
> However, I do expect people that didn't come to the meeting to
> recognise that they are likely to be missing information and to listen
> closely, as I listen to them.

Participation in the community development process is not contingent
on having flown to Ottawa in May, or on having decided to spend that
evening at your briefing meeting.  Attributing to ignorance what is
adequately explained by honest disagreement is impolite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> Simon Riggs  wrote:
> > This is not transaction metadata, it is WAL record metadata
> > required for multi-master replication, see later point.

> > We need to add information to every WAL record that is used as the
> > source for generating LCRs.
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does?  It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata.  You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you?  If not, why is it something to store in every WAL
> record rather than once per transaction?  That's not intended to be
> a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around 
http://archives.postgresql.org/message-
id/201206201605.43634.and...@2ndquadrant.com) it just makes the filtering via 
the originating node a considerably more complex thing. With our proposal you 
can do it without any complexity involved, on a low level. Storing it per 
transaction means you can only stream out the data to other nodes *after* 
fully reassembling the transaction. Thats a pitty, especially if we go for a 
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic 
"transaction metadata" framework atm. That seems to be a good way to spend a 
good part of time in discussion and disagreeing. Imo thats something for 
later.

> I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I 
think the origin id is important.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 10:02 AM, Andres Freund  wrote:
> Were not the only ones here that are performing scope creep though... I think
> about all people who have posted in the whole thread except maybe Tom and
> Marko are guilty of doing so.
>
> I still think its rather sensible to focus on exactly duplicated schemas in a
> very first version just because that leaves out some of the complexity while
> paving the road for other nice things.

Well, I guess what I want to know is: what does focusing on exactly
duplicated schemas mean?  If it means we'll disable DDL for tables
when we turn on replication, that's basically the Slony approach: when
you want to make a DDL change, you have to quiesce replication, do it,
and then resume replication.  I would possibly be OK with that
approach.  If it means that we'll hope that the schemas are duplicated
and start spewing garbage data when they're not, then I'm not
definitely not OK with that approach.  If it means using event
triggers to keep the catalogs synchronized, then I don't think I don't
think that's adequately robust.  The user could add more event
triggers that run before or after the ones the replication system
adds, and then you are back to garbage decoding (or crashes).  They
could also modify the catalogs directly, although it's possible we
don't care quite as much about that case (but on the other hand people
do sometimes need to do it to solve real problems).  Although I am
100% OK with pairing back the initial feature set - indeed, I strongly
recommend it - I think that robustness is not a feature which can be
left out in v1 and added in later.  All the robustness has to be
designed in at the start, or we will never have it.

On the whole, I think we're spending far too much time talking about
code and far too little time talking about what the overall design
should look like.  We are having a discussion about whether or not MMR
should be supported by sticking a 16-bit node ID into every WAL record
without having first decided whether we should support MMR, whether
that requires node IDs, whether they should be integers, whether those
integers should be 16 bits in size, whether they should be present in
WAL, and whether or not the record header is the right place to put
them.  There's a right order in which to resolve those questions, and
this isn't it.  More generally, I think there is a ton of complexity
that we're probably overlooking here in focusing in on specific coding
details.  I think the most interesting comment made to date is Steve
Singer's observation that very little of Slony is concerned with
changeset extraction or apply.  Now, on the flip side, all of these
patches seem to be concerned with changeset extraction and apply.
That suggests that we're missing some pretty significant pieces
somewhere in this design.  I think those pieces are things like error
recovery, fault tolerance, user interface design, and control logic.
Slony has spent years trying to get those things right.  Whether or
not they actually have gotten them right is of course an arguable
point, but we're unlikely to do better by ignoring all of those issues
and implementing whatever is most technically expedient.

>> You've got four people objecting to this patch now, all of whom happen
>> to be committers.  Whether or not MMR goes into core, who knows, but
>> it doesn't seem that this patch is going to fly.
> I find that a bit too early to say. Sure it won't fly exactly as proposed, but
> hell, who cares? What I want to get in is a solution to the specific problem
> the patch targets. At least you have, not sure about others, accepted that the
> problem needs a solution.
> We do not agree yet how that solution looks should like but thats not exactly
> surprising as we started discussing the problem only a good day ago.

Oh, no argument with any of that.  I strongly object to the idea of
shoving this patch through as-is, but I don't object to solving the
problem in some other, more appropriate way.  I think that won't look
much like this patch, though; it will be some new patch.

> If people agree that your proposed way of just one flag bit is the way to go
> we will have to live with that. But thats different from saying the whole
> thing is dead.

I think you've convinced me that a single flag-bit is not enough, but
I don't think you've convinced anyone that it belongs in the record
header.

>> As I would rather see this project
>> succeed, I recommend that you don't do that.  Both you and Andres seem
>> to believe that MMR is a reasonable first target to shoot at, but I
>> don't think anyone else - including the Slony developers who have
>> commented on this issue - endorses that position.
> I don't think we get full MMR into 9.3. What I am proposing is that we build
> in the few pieces that are required to implement MMR *ontop* of whats
> hopefully in 9.3.
> And I think thats a realistic goal.

I can't quite follow that sentence, but my general sense is that,
wh

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner

Simon Riggs  wrote:
> Kevin Grittner  wrote:
>>> Heikki Linnakangas  wrote:
>>
>>> I don't like the idea of adding the origin id to the record
>>> header. It's only required in some occasions, and on some record
>>> types.
>>
>> Right.
> 
> Wrong, as explained.

The point is not wrong; you are simply not responding to what is
being said.

You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread.  You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters.  I understand that this is the configuration you find most
interesting, at least for initial implementation.  That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.

I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing.  What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives.  As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.

> It isn't true that this is needed only for some configurations of
> multi-master, per discussion.

I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.

> This is not transaction metadata, it is WAL record metadata
> required for multi-master replication, see later point.
> 
> We need to add information to every WAL record that is used as the
> source for generating LCRs.

If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does?  It may be a
metadata element about which you have special concerns, but it is
transaction metadata.  You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you?  If not, why is it something to store in every WAL
record rather than once per transaction?  That's not intended to be
a rhetorical question.  I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.

As long as the WAL stream is the medium, options are very
constrained.  You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet. 
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.

>> I think removing origin ID from this patch and submitting a
>> separate patch for a generalized transaction metadata system is
>> the sensible way to go.
> 
> We already have a very flexible WAL system for recording data of
> interest to various resource managers. If you wish to annotate a
> transaction, you can either generate a new kind of WAL record or
> you can enhance a commit record.

Right.  Like many of us are suggesting should be done for origin ID.

> XLOG_NOOP records can already be generated by your application if
> you wish to inject additional metadata to the WAL stream. So no
> changes are required for you to implement the generalised
> transaction metadata scheme you say you require.

I'm glad it's that easy.  Are there SQL functions to for that yet?

> Not sure how or why that relates to requirements for multi-master.

That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused.  I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server.  I'm not clear on your definition -- it seems to
be rather more narrow.  Maybe we need to define some terms somewhere
to facilitate discussion.  Is there a Wiki page where that would
make sense?

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 16:23, Heikki Linnakangas
 wrote:
> On 20.06.2012 11:17, Simon Riggs wrote:
>>
>> On 20 June 2012 15:45, Heikki Linnakangas
>>   wrote:
>>>
>>> On 20.06.2012 10:32, Simon Riggs wrote:


 On 20 June 2012 14:40, Heikki Linnakangas
>
>
> And I'm worried
> it might not even be enough in more complicated scenarios.


 It is not the only required conflict mechanism, and has never been
 claimed to be so. It is simply one piece of information needed, at
 various times.
>>>
>>>
>>> So, if the origin id is not sufficient for some conflict resolution
>>> mechanisms, what extra information do you need for those, and where do
>>> you
>>> put it?
>>
>>
>> As explained elsewhere, wal_level = logical (or similar) would be used
>> to provide any additional logical information required.
>>
>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.
>>
>> In the case of reflexive updates you raised, a typical response in
>> other DBMS would be to represent the query
>>   UPDATE SET counter = counter + 1
>> by sending just the "+1" part, not the current value of counter, as
>> would be the case with the non-reflexive update
>>   UPDATE SET counter = 1
>>
>> Handling such things in Postgres would require some subtlety, which
>> would not be resolved in first release but is pretty certain not to
>> require any changes to the WAL record header as a way of resolving it.
>> Having already thought about it, I'd estimate that is a very long
>> discussion and not relevant to the OT, but if you wish to have it
>> here, I won't stop you.
>
>
> Yeah, I'd like to hear briefly how you would handle that without any further
> changes to the WAL record header.

I already did:

>> Update and Delete WAL records already need to be different in that
>> mode, so additional info would be placed there, if there were any.

The case you mentioned relates to UPDATEs only, so I would suggest
that we add that information to a new form of update record only.

That has nothing to do with the WAL record header.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 21:42, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
>> On 20 June 2012 21:19, Robert Haas  wrote:
>>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
 The idea that logical rep is some kind of useful end goal in itself is
 slightly misleading. If the thought is to block multi-master
 completely on that basis, that would be a shame. Logical rep is the
 mechanism for implementing multi-master.
>>>
>>> If you're saying that single-master logical replication isn't useful,
>>> I disagree.  Of course, having both single-master and multi-master
>>> replication together is even more useful.
>>
>>> But I think getting even
>>> single-master logical replication working well in a single release
>>> cycle is going to be a job and a half.
>>
>> OK, so your estimate is 1.5 people to do that. And if we have more
>> people, should they sit around doing nothing?
>
> Oh, give me a break.  You're willfully missing my point.  And to quote
> Fred Brooks, nine women can't make a baby in one month.

No, I'm not. The question is not how quickly can N people achieve a
single thing, but how long will it take a few skilled people working
on carefully selected tasks that have few dependencies between them to
achieve something.

We have significantly more preparation, development time and resources
than any other project previously performed for PostgreSQL, that I am
aware of.

Stating that it is impossible to perform a task in a certain period of
time without even considering those points is clearly rubbish. I've
arrived at my thinking based upon detailed project planning of what
was possible in the time.

How exactly did you arrive at your conclusion? Why is yours right and
mine wrong?

>>> Thinking that we're going to
>>> get MMR in one release is not realistic.
>>
>> If you block it, then the above becomes true, whether or not it starts true.
>
> If I had no rational basis for my objections, that would be true.
> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.

No, I have four people who had initial objections and who have not
commented on the fact that the points made are regrettably incorrect.
I don't expect everybody commenting on the design to have perfect
knowledge of the whole design, so I expect people to make errors in
their comments. I also expect people to take note of what has been
said before making further objections or drawing conclusions.

Since at least 3 of the people making such comments did not attend the
full briefing meeting in Ottawa, I am not particularly surprised.
However, I do expect people that didn't come to the meeting to
recognise that they are likely to be missing information and to listen
closely, as I listen to them.

"When the facts change, I change my mind. What do you do, sir?"

> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.  As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position.  At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.  I don't think that's going to work.  You also requested that
> people tell you sooner when large patches were in danger of not making
> the release.  Now I'm doing that, VERY early, and you're apparently
> angry about it.  If the only satisfactory outcome of this conversation
> is that everyone agrees with the design pattern you've already decided
> on, then you haven't left yourself very much room to walk away
> satisfied.

Note that I have already myself given review feedback to Andres and
that change has visibly occurred during this thread via public debate.

Claiming that I only stick to what has already been decided is
patently false, with me at least.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 03:54:43 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund  
wrote:
> >> If you do that, then, yes,
> >> everything that you need to disentangle various network topologies
> >> must be present in WAL.  But what I'm saying is: don't do it like
> >> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> >> them around the network, applying them at every node.  Then, the
> >> information that is needed in WAL is confined to one bit: the
> >> knowledge of whether or not a particular transaction is local (and
> >> thus LCRs should be generated) or non-local (and thus they shouldn't,
> >> because the origin already generated them and thus we're just handing
> >> them around to apply everywhere).
> > 
> > Sure, you can do it that way, but I don't think its a good idea. If you
> > do it my way you *guarantee* that when replaying changes from node B on
> > node C you have replayed changes from A at least as far as B has. Thats
> > a really nice property for MM.
> > You *can* get same with your solution but it starts to get complicated
> > rather fast. While my/our proposed solution is trivial to implement.
> 
> That's an interesting point.  I agree that's a useful property, and
> might be a reason not to just use a single-bit flag, but I still think
> you'd be better off handling that requirement via some other method,
> like logging the node ID once per transaction or something.  That lets
> you have as much metadata as you end up needing, which is a lot more
> flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
> said.
If it comes down to that I can definitely live with that. I still think making 
the filtering trivial so it can be done without any logic on a low level is a 
very desirable property but if not, so be it.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 03:42:39 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
> > On 20 June 2012 21:19, Robert Haas  wrote:
> >> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  
wrote:
> >>> The idea that logical rep is some kind of useful end goal in itself is
> >>> slightly misleading. If the thought is to block multi-master
> >>> completely on that basis, that would be a shame. Logical rep is the
> >>> mechanism for implementing multi-master.
> >> 
> >> If you're saying that single-master logical replication isn't useful,
> >> I disagree.  Of course, having both single-master and multi-master
> >> replication together is even more useful.
> >> 
> >> But I think getting even
> >> single-master logical replication working well in a single release
> >> cycle is going to be a job and a half.
> >> Thinking that we're going to
> >> get MMR in one release is not realistic.
> > If you block it, then the above becomes true, whether or not it starts
> > true.
> My main point in bringing this up is that if you pick a project that
> is too large, you will fail.
Were not the only ones here that are performing scope creep though... I think 
about all people who have posted in the whole thread except maybe Tom and 
Marko are guilty of doing so.

I still think its rather sensible to focus on exactly duplicated schemas in a 
very first version just because that leaves out some of the complexity while 
paving the road for other nice things.

> You've got four people objecting to this patch now, all of whom happen
> to be committers.  Whether or not MMR goes into core, who knows, but
> it doesn't seem that this patch is going to fly.
I find that a bit too early to say. Sure it won't fly exactly as proposed, but 
hell, who cares? What I want to get in is a solution to the specific problem 
the patch targets. At least you have, not sure about others, accepted that the 
problem needs a solution.
We do not agree yet how that solution looks should like but thats not exactly 
surprising as we started discussing the problem only a good day ago.

If people agree that your proposed way of just one flag bit is the way to go 
we will have to live with that. But thats different from saying the whole 
thing is dead.

> As I would rather see this project
> succeed, I recommend that you don't do that.  Both you and Andres seem
> to believe that MMR is a reasonable first target to shoot at, but I
> don't think anyone else - including the Slony developers who have
> commented on this issue - endorses that position. 
I don't think we get full MMR into 9.3. What I am proposing is that we build 
in the few pieces that are required to implement MMR *ontop* of whats 
hopefully in 9.3.
And I think thats a realistic goal.

> At PGCon, you were
> talking about getting a new set of features into PG over the next 3-5
> years.  Now, it seems like you want to compress that timeline to a
> year.
Well, I obviously would like to be everything be done in a release, but I also 
would like to go hiking for a year, have a restored sailing boat and some 
more. That doesn't make it reality...
To make it absolutely clear: I definitely don't think its realistic to have 
everything in 9.3. And I don't think Simon does so either. What I want is to 
have the basic building blocks in 9.3.
The difference probably just is whats determined as a building block.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 9:43 AM, Andres Freund  wrote:
>> If you do that, then, yes,
>> everything that you need to disentangle various network topologies
>> must be present in WAL.  But what I'm saying is: don't do it like
>> that.  Generate the LCRs just ONCE, at the origin node, and then pass
>> them around the network, applying them at every node.  Then, the
>> information that is needed in WAL is confined to one bit: the
>> knowledge of whether or not a particular transaction is local (and
>> thus LCRs should be generated) or non-local (and thus they shouldn't,
>> because the origin already generated them and thus we're just handing
>> them around to apply everywhere).
> Sure, you can do it that way, but I don't think its a good idea. If you do it
> my way you *guarantee* that when replaying changes from node B on node C you
> have replayed changes from A at least as far as B has. Thats a really nice
> property for MM.
> You *can* get same with your solution but it starts to get complicated rather
> fast. While my/our proposed solution is trivial to implement.

That's an interesting point.  I agree that's a useful property, and
might be a reason not to just use a single-bit flag, but I still think
you'd be better off handling that requirement via some other method,
like logging the node ID once per transaction or something.  That lets
you have as much metadata as you end up needing, which is a lot more
flexible than a 16-bit field, as Kevin, Heikki, and Tom have also
said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Hannu Valtonen

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/19/2012 01:47 AM, Christopher Browne wrote:
> That numbering scheme gets pretty anti-intuitive fairly quickly,
> from whence we took the approach of having a couple digits
> indicating data centre followed by a digit indicating which node in
> that data centre.
> 
> If that all sounds incoherent, well, the more nodes you have
> around, the more difficult it becomes to make sure you *do* have a
> coherent picture of your cluster.
> 
> I recall the Slony-II project having a notion of attaching a
> permanent UUID-based node ID to each node.  As long as there is
> somewhere decent to find a symbolically significant node "name," I
> like the idea of the ID *not* being in a tiny range, and being
> UUID/OID-like...
> 

Just as a sidenote, MySQL's new global transaction ids use a UUID for
the serverid bit of it. [1]

- - Hannu Valtonen

[1]
http://d2-systems.blogspot.fi/2012/04/global-transaction-identifiers-are-in.html
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP4B2ZAAoJENx2aQJr96ohm/4QAOKqFkyBlbwn1k69z6jmqbyq
vn1pCtAKgpdL3wdOEIipotylJnyl9Aqd8IRQT/j3u2dDOWbFqf6Xs3t8x6NbE06T
4kBANs7VjZwSgUlnmU6zsGhX5RB+mHEmnjKn+GG3uA4m5Qo5WvvzfwxKXDdOfSuw
ChLgTTse28OV5xOeQnU4FSPPrsc0AxL8suoUDAi3Qp3EkefgN7zzkfLkw1B+E119
mfxSus9nu4rl4givP+z8VMtfIhU4EqQvQvcI5w6E4aW88iYxzkH/BICyBYGg73e4
SnJfLpaXg/C/Ll0NjKRr9Gsuxl1yStb7vPzc0AQahyE2IyspN+Ga5Y1eyvEfcv5s
cOpKFLMPE5sQpvTdcSfZ3/nGotos1PDijVpAy7qVY3m0ow5ECyGv/8sdXSQXut+x
f9QVNe3rOznsy3J38Z8OOEaftq30UZt7+cXl4fMvI/eVkva+MyHjdR+aIqzMVh1d
S6uCpkK/UUC11PCENdRmVIka6EHAsvs+m9B4kmsHpR4+T/qKVaLdBX27L7k2prse
OmJ1qFtOmMqZJHFZ8oCVYoVoK5UEaJfZBLbSyf9vU0iSDh2XlKgVNe6TTE0f+OVE
GcqIT2qbc9XG6hFO/C0zy5qdNXem96feBszjWLhf+F9ULq1d0HNEg2WghJJPpan1
adx5JGZK5k0xtSewTj7O
=lK2f
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 20:37, Kevin Grittner  wrote:
>> Heikki Linnakangas  wrote:
>
>> I don't like the idea of adding the origin id to the record header.
>> It's only required in some occasions, and on some record types.
>
> Right.

Wrong, as explained.

>> And I'm worried it might not even be enough in more complicated
>> scenarios.
>>
>> Perhaps we need a more generic WAL record annotation system, where
>> a plugin can tack arbitrary information to WAL records. The extra
>> information could be stored in the WAL record after the rmgr
>> payload, similar to how backup blocks are stored. WAL replay could
>> just ignore the annotations, but a replication system could use it
>> to store the origin id or whatever extra information it needs.
>
> Not only would that handle absolute versus relative updates and
> origin id, but application frameworks could take advantage of such a
> system for passing transaction metadata.  I've held back on one
> concern so far that I'll bring up now because this suggestion would
> address it nicely.
>
> Our current trigger-driven logical replication includes a summary
> which includes transaction run time, commit time, the transaction
> type identifier, the source code line from which that transaction was
> invoked, the user ID with which the user connected to the application
> (which isn't the same as the database login), etc.  Being able to
> "decorate" a database transaction with arbitrary (from the DBMS POV)
> metadata would be very valuable.  In fact, our shop can't maintain
> the current level of capabilities without *some* way to associate
> such information with a transaction.

> I think that using up the only unused space in the fixed header to
> capture one piece of the transaction metadata needed for logical
> replication, and that only in some configurations, is short-sighted.
> If we solve the general problem of transaction metadata, this one
> specific case will fall out of that.

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being "used up" as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

It isn't true that this is needed only for some configurations of
multi-master, per discussion.

This is not transaction metadata, it is WAL record metadata required
for multi-master replication, see later point.

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

> I think removing origin ID from this patch and submitting a separate
> patch for a generalized transaction metadata system is the sensible
> way to go.

We already have a very flexible WAL system for recording data of
interest to various resource managers. If you wish to annotate a
transaction, you can either generate a new kind of WAL record or you
can enhance a commit record. There are already unused flag bits on
commit records for just such a purpose.

XLOG_NOOP records can already be generated by your application if you
wish to inject additional metadata to the WAL stream. So no changes
are required for you to implement the generalised transaction metadata
scheme you say you require.

Not sure how or why that relates to requirements for multi-master.

Please note that I've suggested review changes to Andres' work myself.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 03:02:28 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund  
wrote:
> > One bit is fine if you have only very simple replication topologies. Once
> > you think about globally distributed databases its a bit different. You
> > describe some of that below, but just to reiterate:
> > Imagine having 6 nodes, 3 on one of two continents (ABC in north america,
> > DEF in europe). You may only want to have full intercontinental
> > interconnect between two of those (say A and D). If you only have one
> > bit to represent the origin thats not going to work because you won't be
> > able discern the changes from BC on A from the changes from those
> > originating on DEF.
> 
> I don't see the problem.  A certainly knows via which link the LCRs
> arrived.

> So: change happens on A.  A sends the change to B, C, and D.  B and C
> apply the change.  One bit is enough to keep them from regenerating
> new LCRs that get sent back to A.  So they're fine.  D also receives
> the changes (from A) and applies them, but it also does not need to
> regenerate LCRs.  Instead, it can take the LCRs that it has already
> got (from A) and send those to E and F.

> Or: change happens on B.  B sends the changes to A.  Since A knows the
> network topology, it sends the changes to C and D.  D sends them to E
> and F.  Nobody except B needs to *generate* LCRs.  All any other node
> needs to do is suppress *redundant* LCR generation.
> 
> > Another topology which is interesting is circular replications (i.e.
> > changes get shipped A->B, B->C, C->A) which is a sensible topology if
> > you only have a low change rate and a relatively high number of nodes
> > because you don't need the full combinatorial amount of connections.
> 
> I think this one is OK too.  You just generate LCRs on the origin node
> and then pass them around the ring at every step.  When the next hop
> would be the origin node then you're done.
> 
> I think you may be imagining that A generates LCRs and sends them to
> B.  B applies them, and then from the WAL just generated, it produces
> new LCRs which then get sent to C. 
Yes, thats what I am proposing.

> If you do that, then, yes,
> everything that you need to disentangle various network topologies
> must be present in WAL.  But what I'm saying is: don't do it like
> that.  Generate the LCRs just ONCE, at the origin node, and then pass
> them around the network, applying them at every node.  Then, the
> information that is needed in WAL is confined to one bit: the
> knowledge of whether or not a particular transaction is local (and
> thus LCRs should be generated) or non-local (and thus they shouldn't,
> because the origin already generated them and thus we're just handing
> them around to apply everywhere).
Sure, you can do it that way, but I don't think its a good idea. If you do it 
my way you *guarantee* that when replaying changes from node B on node C you 
have replayed changes from A at least as far as B has. Thats a really nice 
property for MM.
You *can* get same with your solution but it starts to get complicated rather 
fast. While my/our proposed solution is trivial to implement.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 9:25 AM, Simon Riggs  wrote:
> On 20 June 2012 21:19, Robert Haas  wrote:
>> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
>>> The idea that logical rep is some kind of useful end goal in itself is
>>> slightly misleading. If the thought is to block multi-master
>>> completely on that basis, that would be a shame. Logical rep is the
>>> mechanism for implementing multi-master.
>>
>> If you're saying that single-master logical replication isn't useful,
>> I disagree.  Of course, having both single-master and multi-master
>> replication together is even more useful.
>
>> But I think getting even
>> single-master logical replication working well in a single release
>> cycle is going to be a job and a half.
>
> OK, so your estimate is 1.5 people to do that. And if we have more
> people, should they sit around doing nothing?

Oh, give me a break.  You're willfully missing my point.  And to quote
Fred Brooks, nine women can't make a baby in one month.

>> Thinking that we're going to
>> get MMR in one release is not realistic.
>
> If you block it, then the above becomes true, whether or not it starts true.

If I had no rational basis for my objections, that would be true.
You've got four people objecting to this patch now, all of whom happen
to be committers.  Whether or not MMR goes into core, who knows, but
it doesn't seem that this patch is going to fly.

My main point in bringing this up is that if you pick a project that
is too large, you will fail.  As I would rather see this project
succeed, I recommend that you don't do that.  Both you and Andres seem
to believe that MMR is a reasonable first target to shoot at, but I
don't think anyone else - including the Slony developers who have
commented on this issue - endorses that position.  At PGCon, you were
talking about getting a new set of features into PG over the next 3-5
years.  Now, it seems like you want to compress that timeline to a
year.  I don't think that's going to work.  You also requested that
people tell you sooner when large patches were in danger of not making
the release.  Now I'm doing that, VERY early, and you're apparently
angry about it.  If the only satisfactory outcome of this conversation
is that everyone agrees with the design pattern you've already decided
on, then you haven't left yourself very much room to walk away
satisfied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 21:19, Robert Haas  wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
>> The idea that logical rep is some kind of useful end goal in itself is
>> slightly misleading. If the thought is to block multi-master
>> completely on that basis, that would be a shame. Logical rep is the
>> mechanism for implementing multi-master.
>
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.

> But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.

OK, so your estimate is 1.5 people to do that. And if we have more
people, should they sit around doing nothing?

> Thinking that we're going to
> get MMR in one release is not realistic.

If you block it, then the above becomes true, whether or not it starts true.

You may not want MMR, but others do. I see no reason to prevent people
from having it, which is what you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 03:19:55 PM Robert Haas wrote:
> On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
> > The idea that logical rep is some kind of useful end goal in itself is
> > slightly misleading. If the thought is to block multi-master
> > completely on that basis, that would be a shame. Logical rep is the
> > mechanism for implementing multi-master.
> 
> If you're saying that single-master logical replication isn't useful,
> I disagree.  Of course, having both single-master and multi-master
> replication together is even more useful.  But I think getting even
> single-master logical replication working well in a single release
> cycle is going to be a job and a half.  Thinking that we're going to
> get MMR in one release is not realistic.  The only way to make it
> realistic is to put MMR ahead of every other goal that people have for
> logical replication, including robustness and stability.  It's
> entirely premature to be designing features for MMR when we don't even
> have the design for SMR nailed down yet.  And that's even assuming we
> EVER want MMR in core, which has not even really been argued, let
> alone agreed.
I agree it has not been agreed uppon, but I certainly would consider 
submitting a prototype implementing it an argument for doing it ;)

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 5:47 AM, Simon Riggs  wrote:
> The idea that logical rep is some kind of useful end goal in itself is
> slightly misleading. If the thought is to block multi-master
> completely on that basis, that would be a shame. Logical rep is the
> mechanism for implementing multi-master.

If you're saying that single-master logical replication isn't useful,
I disagree.  Of course, having both single-master and multi-master
replication together is even more useful.  But I think getting even
single-master logical replication working well in a single release
cycle is going to be a job and a half.  Thinking that we're going to
get MMR in one release is not realistic.  The only way to make it
realistic is to put MMR ahead of every other goal that people have for
logical replication, including robustness and stability.  It's
entirely premature to be designing features for MMR when we don't even
have the design for SMR nailed down yet.  And that's even assuming we
EVER want MMR in core, which has not even really been argued, let
alone agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Robert Haas

On Wed, Jun 20, 2012 at 5:15 AM, Andres Freund  wrote:
> As I said before, I definitely agree that we want to have a separate transport
> format once we have decoding nailed down. We still need to ship wal around if
> the decoding happens in a different instance, but *after* that it can be
> shipped in something more convenient/appropriate.

Right, OK, we agree on this then.

> One bit is fine if you have only very simple replication topologies. Once you
> think about globally distributed databases its a bit different. You describe
> some of that below, but just to reiterate:
> Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF
> in europe). You may only want to have full intercontinental interconnect
> between two of those (say A and D). If you only have one bit to represent the
> origin thats not going to work because you won't be able discern the changes
> from BC on A from the changes from those originating on DEF.

I don't see the problem.  A certainly knows via which link the LCRs arrived.

So: change happens on A.  A sends the change to B, C, and D.  B and C
apply the change.  One bit is enough to keep them from regenerating
new LCRs that get sent back to A.  So they're fine.  D also receives
the changes (from A) and applies them, but it also does not need to
regenerate LCRs.  Instead, it can take the LCRs that it has already
got (from A) and send those to E and F.

Or: change happens on B.  B sends the changes to A.  Since A knows the
network topology, it sends the changes to C and D.  D sends them to E
and F.  Nobody except B needs to *generate* LCRs.  All any other node
needs to do is suppress *redundant* LCR generation.

> Another topology which is interesting is circular replications (i.e. changes
> get shipped A->B, B->C, C->A) which is a sensible topology if you only have a
> low change rate and a relatively high number of nodes because you don't need
> the full combinatorial amount of connections.

I think this one is OK too.  You just generate LCRs on the origin node
and then pass them around the ring at every step.  When the next hop
would be the origin node then you're done.

I think you may be imagining that A generates LCRs and sends them to
B.  B applies them, and then from the WAL just generated, it produces
new LCRs which then get sent to C.  If you do that, then, yes,
everything that you need to disentangle various network topologies
must be present in WAL.  But what I'm saying is: don't do it like
that.  Generate the LCRs just ONCE, at the origin node, and then pass
them around the network, applying them at every node.  Then, the
information that is needed in WAL is confined to one bit: the
knowledge of whether or not a particular transaction is local (and
thus LCRs should be generated) or non-local (and thus they shouldn't,
because the origin already generated them and thus we're just handing
them around to apply everywhere).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Kevin Grittner

> Heikki Linnakangas  wrote:

> I don't like the idea of adding the origin id to the record header.
> It's only required in some occasions, and on some record types.

Right.

> And I'm worried it might not even be enough in more complicated
> scenarios.
>
> Perhaps we need a more generic WAL record annotation system, where
> a plugin can tack arbitrary information to WAL records. The extra
> information could be stored in the WAL record after the rmgr
> payload, similar to how backup blocks are stored. WAL replay could
> just ignore the annotations, but a replication system could use it
> to store the origin id or whatever extra information it needs.

Not only would that handle absolute versus relative updates and
origin id, but application frameworks could take advantage of such a
system for passing transaction metadata.  I've held back on one
concern so far that I'll bring up now because this suggestion would
address it nicely.

Our current trigger-driven logical replication includes a summary
which includes transaction run time, commit time, the transaction
type identifier, the source code line from which that transaction was
invoked, the user ID with which the user connected to the application
(which isn't the same as the database login), etc.  Being able to
"decorate" a database transaction with arbitrary (from the DBMS POV)
metadata would be very valuable.  In fact, our shop can't maintain
the current level of capabilities without *some* way to associate
such information with a transaction.

I think that using up the only unused space in the fixed header to
capture one piece of the transaction metadata needed for logical
replication, and that only in some configurations, is short-sighted.
If we solve the general problem of transaction metadata, this one
specific case will fall out of that.

I think removing origin ID from this patch and submitting a separate
patch for a generalized transaction metadata system is the sensible
way to go.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 16:44, Heikki Linnakangas
 wrote:
> On 20.06.2012 11:34, Simon Riggs wrote:
>>
>> On 20 June 2012 16:23, Heikki Linnakangas
>>   wrote:
>>
>>> It's only needed for multi-master replication, where the same table can
>>> be
>>> updated from multiple nodes. Just leave that out for now. There's plenty
>>> of
>>> functionality and issues left even without that.
>>
>>
>> Huh? Multi-master replication is what is being built here and many
>> people want that.
>
>
> Sure, but presumably you're going to implement master-slave first, and build
> multi-master on top of that. What I'm saying is that we can leave out the
> origin-id for now, since we don't have agreement on it, and revisit it after
> master-slave replication is working.

I am comfortable with the idea of deferring applying the patch, but I
don't see any need to defer agreeing the patch is OK, so it can be
applied easily later. It does beg the question of when exactly we
would defer it to though. When would that be?

If you have a reason for disagreement, please raise it now, having
seen explanations/comments on various concerns. Of course, people have
made initial objections, which is fine but its not reasonable to
assume that such complaints continue to exist after. Perhaps there are
other thoughts?

The idea that logical rep is some kind of useful end goal in itself is
slightly misleading. If the thought is to block multi-master
completely on that basis, that would be a shame. Logical rep is the
mechanism for implementing multi-master.

Deferring this could easily end up with a huge patch in last CF, and
then it will be rejected/deferred. Patch submission here is following
the requested process - as early as possible, production ready, small
meaningful patches that build towards a final goal. This is especially
true for format changes, which is why this patch is here now. Doing it
differently just makes patch wrangling and review more difficult,
which reduces overall quality and slows down development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Andres Freund

On Wednesday, June 20, 2012 02:35:59 AM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne  
wrote:
> > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  
wrote:
> >>> Btw, what do you mean with "conflating" the stream? I don't really see
> >>> that being proposed.
> >> 
> >> It seems to me that you are intent on using the WAL stream as the
> >> logical change stream.  I think that's a bad design.  Instead, you
> >> should extract changes from WAL and then ship them around in a format
> >> that is specific to logical replication.
> > 
> > Yeah, that seems worth elaborating on.
> > 
> > What has been said several times is that it's pretty necessary to
> > capture the logical changes into WAL.  That seems pretty needful, in
> > order that the replication data gets fsync()ed avidly, and so that we
> > don't add in the race condition of needing to fsync() something *else*
> > almost exactly as avidly as is the case for WAL today..
> 
> Check.
> 
> > But it's undesirable to pull *all* the bulk of contents of WAL around
> > if it's only part of the data that is going to get applied.  On a
> > "physical streaming" replica, any logical data that gets captured will
> > be useless.  And on a "logical replica," they "physical" bits of WAL
> > will be useless.
> > 
> > What I *want* you to mean is that there would be:
> > a) WAL readers that pull the "physical bits", and
> > b) WAL readers that just pull "logical bits."
> > 
> > I expect it would be fine to have a tool that pulls LCRs out of WAL to
> > prepare that to be sent to remote locations.  Is that what you have in
> > mind?
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.  This discussion illustrates why:
> we're talking about consuming scarce bit-space in WAL records for a
> feature that only a tiny minority of users will use, and it's still
> not really enough bit space.  That stinks.  If LCR transmission is a
> separate protocol, this problem can be engineered away at a higher
> level.
As I said before, I definitely agree that we want to have a separate transport 
format once we have decoding nailed down. We still need to ship wal around if 
the decoding happens in a different instance, but *after* that it can be 
shipped in something more convenient/appropriate.

> Suppose we have three servers, A, B, and C, that are doing
> multi-master replication in a loop.  A sends LCRs to B, B sends them
> to C, and C sends them back to A.  Obviously, we need to make sure
> that each server applies each set of changes just once, but it
> suffices to have enough information in WAL to distinguish between
> replication transactions and non-replication transactions - that is,
> one bit.  So suppose a change is made on server A.  A generates LCRs
> from WAL, and tags each LCR with node_id = A.  It then sends those
> LCRs to B.  B applies them, flagging the apply transaction in WAL as a
> replication transaction, AND ALSO sends the LCRs to C.  The LCR
> generator on B sees the WAL from apply, but because it's flagged as a
> replication transaction, it does not generate LCRs.  So C receives
> LCRs from B just once, without any need for the node_id to to be known
> in WAL.  C can now also apply those LCRs (again flagging the apply
> transaction as replication) and it can also skip sending them to A,
> because it seems that they originated at A.
One bit is fine if you have only very simple replication topologies. Once you 
think about globally distributed databases its a bit different. You describe 
some of that below, but just to reiterate: 
Imagine having 6 nodes, 3 on one of two continents (ABC in north america, DEF 
in europe). You may only want to have full intercontinental interconnect 
between two of those (say A and D). If you only have one bit to represent the 
origin thats not going to work because you won't be able discern the changes 
from BC on A from the changes from those originating on DEF.

Another topology which is interesting is circular replications (i.e. changes 
get shipped A->B, B->C, C->A) which is a sensible topology if you only have a 
low change rate and a relatively high number of nodes because you don't need 
the full combinatorial amount of connections.

You still have the origin_id's be meaningful in the local context though. As 
described before, in the communication between the different nodes you can 
simply replace 16bit node id with some fancy UUID or such. And do the reverse 
when replaying LCRs.

> Now suppose we have a more complex topology.  Suppose we have a
> cluster of four servers A .. D which, for improved tolerance against
> network outages, are all connected pairwise.  Normally all the links
> are up, so each server sends all the LCRs it generates directly to all
> other servers.  But how do we prevent cycles?  A gen

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 11:34, Simon Riggs wrote:

On 20 June 2012 16:23, Heikki Linnakangas
  wrote:


It's only needed for multi-master replication, where the same table can be
updated from multiple nodes. Just leave that out for now. There's plenty of
functionality and issues left even without that.


Huh? Multi-master replication is what is being built here and many
people want that.


Sure, but presumably you're going to implement master-slave first, and 
build multi-master on top of that. What I'm saying is that we can leave 
out the origin-id for now, since we don't have agreement on it, and 
revisit it after master-slave replication is working.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 16:23, Heikki Linnakangas
 wrote:

> It's only needed for multi-master replication, where the same table can be
> updated from multiple nodes. Just leave that out for now. There's plenty of
> functionality and issues left even without that.

Huh? Multi-master replication is what is being built here and many
people want that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 11:17, Simon Riggs wrote:

On 20 June 2012 15:45, Heikki Linnakangas
  wrote:

On 20.06.2012 10:32, Simon Riggs wrote:


On 20 June 2012 14:40, Heikki Linnakangas


And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution
mechanisms, what extra information do you need for those, and where do you
put it?


As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
   UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
   UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.


Yeah, I'd like to hear briefly how you would handle that without any 
further changes to the WAL record header.



Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the
table, then I'll have to just outright object to this patch. Let's implement
what we can without the origin id, and revisit this later.


As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.


It's only needed for multi-master replication, where the same table can 
be updated from multiple nodes. Just leave that out for now. There's 
plenty of functionality and issues left even without that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 15:45, Heikki Linnakangas
 wrote:
> On 20.06.2012 10:32, Simon Riggs wrote:
>>
>> On 20 June 2012 14:40, Heikki Linnakangas
>>>
>>> And I'm worried
>>>
>>> it might not even be enough in more complicated scenarios.
>>
>>
>> It is not the only required conflict mechanism, and has never been
>> claimed to be so. It is simply one piece of information needed, at
>> various times.
>
>
> So, if the origin id is not sufficient for some conflict resolution
> mechanisms, what extra information do you need for those, and where do you
> put it?

As explained elsewhere, wal_level = logical (or similar) would be used
to provide any additional logical information required.

Update and Delete WAL records already need to be different in that
mode, so additional info would be placed there, if there were any.

In the case of reflexive updates you raised, a typical response in
other DBMS would be to represent the query
  UPDATE SET counter = counter + 1
by sending just the "+1" part, not the current value of counter, as
would be the case with the non-reflexive update
  UPDATE SET counter = 1

Handling such things in Postgres would require some subtlety, which
would not be resolved in first release but is pretty certain not to
require any changes to the WAL record header as a way of resolving it.
Having already thought about it, I'd estimate that is a very long
discussion and not relevant to the OT, but if you wish to have it
here, I won't stop you.

>>> Perhaps we need a more generic WAL record annotation system, where a
>>> plugin
>>> can tack arbitrary information to WAL records. The extra information
>>> could
>>> be stored in the WAL record after the rmgr payload, similar to how backup
>>> blocks are stored. WAL replay could just ignore the annotations, but a
>>> replication system could use it to store the origin id or whatever extra
>>> information it needs.
>>
>>
>> Additional information required by logical information will be handled
>> by a new wal_level.
>>
>> The discussion here is about adding origin_node_id *only*, which needs
>> to be added on each WAL record.
>
>
> If that's all we can discuss here is, and all other options are off the
> table, then I'll have to just outright object to this patch. Let's implement
> what we can without the origin id, and revisit this later.

As explained, we can do nothing without the origin id. It is not
optional or avoidable in the way you've described.

We have the choice to add the required information as a static or as a
variable length addition to the WAL record header. Since there is no
additional requirement for expansion of the header at this point and
no measurable harm in doing so, I suggest we avoid the complexity and
lack of robustness that a variable length header would cause. Of
course, we can go there later if needed, but there is no current need.
If a variable length header had been suggested, it is certain somebody
would say that was overcooked and just do it as static.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Heikki Linnakangas


On 20.06.2012 10:32, Simon Riggs wrote:

On 20 June 2012 14:40, Heikki Linnakangas

And I'm worried
it might not even be enough in more complicated scenarios.


It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.


So, if the origin id is not sufficient for some conflict resolution 
mechanisms, what extra information do you need for those, and where do 
you put it?



Perhaps we need a more generic WAL record annotation system, where a plugin
can tack arbitrary information to WAL records. The extra information could
be stored in the WAL record after the rmgr payload, similar to how backup
blocks are stored. WAL replay could just ignore the annotations, but a
replication system could use it to store the origin id or whatever extra
information it needs.


Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.


If that's all we can discuss here is, and all other options are off the 
table, then I'll have to just outright object to this patch. Let's 
implement what we can without the origin id, and revisit this later.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-20 Thread Simon Riggs

On 20 June 2012 14:40, Heikki Linnakangas
 wrote:

> The reason we need an origin id in this scenario is that otherwise this will
> happen:
>
> 1. A row is updated on node A
> 2. Node B receives the WAL record from A, and updates the corresponding row
> in B. This generates a new WAL record.
> 3. Node A receives the WAL record from B, and updates the rows again. This
> again generates a new WAL record, which is replicated to A, and you loop
> indefinitely.
>
> If each WAL record carries an origin id, node A can use it to refrain from
> applying the WAL record it receives from B, which breaks the loop.
>
> However, note that in this simple scenario, if the logical log replay /
> conflict resolution is smart enough to recognize that the row has already
> been updated, because the old and the new rows are identical, the loop is
> broken at step 3 even without the origin id. That works for the
> newest-update-wins and similar strategies. So the origin id is not
> absolutely necessary in this case.

Including the origin id in the WAL allows us to filter out WAL records
when we generate LCRs, so we can completely avoid step 3, including
all of the CPU, disk and network overhead that implies. Simply put, we
know the change came from A, so no need to send the change to A again.

If we do allow step 3 to exist, we still need to send the origin id.
This is because there is no apriori way to know the origin id is not
required, such as the case where we have concurrent updates which is
effectively a race condition between actions on separate nodes. Not
sending the origin id because it is not required in some cases is
equivalent to saying we can skip locking because a race condition does
not happen in all cases. Making a case that the race condition is rare
is still not a case for skipping locking. Same here: we need the
information to avoid making errors in the general case.

> Another interesting scenario is that you maintain a global counter, like in
> an inventory system, and conflicts are resolved by accumulating the updates.
> For example, if you do "UPDATE SET counter = counter + 1" simultaneously on
> two nodes, the result is that the counter is incremented by two. The
> avoid-update-if-already-identical optimization doesn't help in this case,
> the origin id is necessary.
>
> Now, let's take the inventory system example further. There are actually two
> ways to update a counter. One is when an item is checked in or out of the
> warehouse, ie. "UPDATE counter = counter + 1". Those updates should
> accumulate. But another operation resets the counter to a specific value,
> "UPDATE counter = 10", like when taking an inventory. That should not
> accumulate with other changes, but should be newest-update-wins. The origin
> id is not enough for that, because by looking at the WAL record and the
> origin id, you don't know which type of an update it was.

Yes, of course. Conflict handling in the general case requires much
additional work. This thread is one minor change of many related
changes. The patches are being submitted in smaller chunks to ease
review, so sometimes there are cross links between things.

> So, I don't like the idea of adding the origin id to the record header. It's
> only required in some occasions, and on some record types.

That conclusion doesn't follow from your stated arguments.

> And I'm worried
> it might not even be enough in more complicated scenarios.

It is not the only required conflict mechanism, and has never been
claimed to be so. It is simply one piece of information needed, at
various times.

> Perhaps we need a more generic WAL record annotation system, where a plugin
> can tack arbitrary information to WAL records. The extra information could
> be stored in the WAL record after the rmgr payload, similar to how backup
> blocks are stored. WAL replay could just ignore the annotations, but a
> replication system could use it to store the origin id or whatever extra
> information it needs.

Additional information required by logical information will be handled
by a new wal_level.

The discussion here is about adding origin_node_id *only*, which needs
to be added on each WAL record.

One question I raised in my review was whether this extra information
should be added by a variable length header, so I already asked this
very question. So far there is no evidence that the additional code
complexity would be warranted. If it became so in the future, it can
be modified again. At this stage there is no need, so the proposal is
to add the field to every WAL record without regard to the setting of
wal_level because there is no measurable downside to doing so. The
downsides of additional complexity are clear and real however, so I
wish to avoid them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http:/

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 08:35, Robert Haas  wrote:

>> I expect it would be fine to have a tool that pulls LCRs out of WAL to
>> prepare that to be sent to remote locations.  Is that what you have in
>> mind?
>
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.

You're confusing things here.

Yes, we can use a different on-the-wire format. No problem.

As I've already said, the information needs to be in WAL first before
we can put it into LCRs, so the "don't use same format" argument is
not relevant as to why the info must be on the WAL record in the first
place. And as said elsewhere, doing that does not cause problems in
any of these areas:  wal bloat, performance, long term restrictions on
numbers of nodeids, as has so far been claimed on this thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Heikki Linnakangas


On 20.06.2012 01:27, Kevin Grittner wrote:

Andres Freund  wrote:


Yes, thats definitely a valid use-case. But that doesn't preclude
the other - also not uncommon - use-case where you want to have
different master which all contain up2date data.


I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.


I think it would be helpful to list down a few concrete examples of 
this. The stereotypical multi-master scenario is that you have a single 
table that's replicated to two servers, and you can insert/update/delete 
on either server. Conflict resolution stretegies vary.


The reason we need an origin id in this scenario is that otherwise this 
will happen:


1. A row is updated on node A
2. Node B receives the WAL record from A, and updates the corresponding 
row in B. This generates a new WAL record.
3. Node A receives the WAL record from B, and updates the rows again. 
This again generates a new WAL record, which is replicated to A, and you 
loop indefinitely.


If each WAL record carries an origin id, node A can use it to refrain 
from applying the WAL record it receives from B, which breaks the loop.


However, note that in this simple scenario, if the logical log replay / 
conflict resolution is smart enough to recognize that the row has 
already been updated, because the old and the new rows are identical, 
the loop is broken at step 3 even without the origin id. That works for 
the newest-update-wins and similar strategies. So the origin id is not 
absolutely necessary in this case.


Another interesting scenario is that you maintain a global counter, like 
in an inventory system, and conflicts are resolved by accumulating the 
updates. For example, if you do "UPDATE SET counter = counter + 1" 
simultaneously on two nodes, the result is that the counter is 
incremented by two. The avoid-update-if-already-identical optimization 
doesn't help in this case, the origin id is necessary.


Now, let's take the inventory system example further. There are actually 
two ways to update a counter. One is when an item is checked in or out 
of the warehouse, ie. "UPDATE counter = counter + 1". Those updates 
should accumulate. But another operation resets the counter to a 
specific value, "UPDATE counter = 10", like when taking an inventory. 
That should not accumulate with other changes, but should be 
newest-update-wins. The origin id is not enough for that, because by 
looking at the WAL record and the origin id, you don't know which type 
of an update it was.


So, I don't like the idea of adding the origin id to the record header. 
It's only required in some occasions, and on some record types. And I'm 
worried it might not even be enough in more complicated scenarios.


Perhaps we need a more generic WAL record annotation system, where a 
plugin can tack arbitrary information to WAL records. The extra 
information could be stored in the WAL record after the rmgr payload, 
similar to how backup blocks are stored. WAL replay could just ignore 
the annotations, but a replication system could use it to store the 
origin id or whatever extra information it needs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 11:26, Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> Simon Riggs  wrote:
>>> The proposal is to use WAL to generate the logical change stream.
>>> That has been shown in testing to be around x4 faster than having
>>> a separate change stream, which must also be WAL logged (as Jan
>>> noted).
>
>> Sure, that's why I want it.
>
> I think this argument is basically circular.  The reason it's 4x faster
> is that the WAL stream doesn't actually contain all the information
> needed to generate LCRs (thus all the angst about maintaining catalogs
> in sync, what to do about unfriendly datatypes, etc).  By the time the
> dust has settled and you have a workable system, you will have bloated
> WAL and given back a large chunk of that multiple, thereby invalidating
> the design premise.  Or at least that's my prediction.

The tests were conducted with the additional field added, so your
prediction is not verified.

The additional fields do not bloat WAL records - they take up exactly
the same space as before.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane

"Kevin Grittner"  writes:
> Simon Riggs  wrote:
>> The proposal is to use WAL to generate the logical change stream.
>> That has been shown in testing to be around x4 faster than having
>> a separate change stream, which must also be WAL logged (as Jan
>> noted).

> Sure, that's why I want it.

I think this argument is basically circular.  The reason it's 4x faster
is that the WAL stream doesn't actually contain all the information
needed to generate LCRs (thus all the angst about maintaining catalogs
in sync, what to do about unfriendly datatypes, etc).  By the time the
dust has settled and you have a workable system, you will have bloated
WAL and given back a large chunk of that multiple, thereby invalidating
the design premise.  Or at least that's my prediction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas

On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund  wrote:
> I definitely agree that low-level apply is possible as a module. Sure change
> extraction needs core support but I was talking about what you need to
> implement it reusing the "plain" logical support...
>
> What I do not understand is how you want to prevent loops in a simple manner
> without in core support:
>
> A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT
> action.
> B reads the lcr stream from A and applies the changes. A new HEAP_INSERT
> record. Gets decoded into the lcr stream as a INSERT action.
> A reads the lcr stream from B and ???
>
> At this point you need to prevent a loop. If you have the information where a
> change originally happened (xl_origin_id = A in this case) you can have the
> simple filter on A which ignores change records if lcr_origin_id ==
> local_replication_origin_id).

See my email to Chris Browne, which I think covers this.  It needs a
bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but
not a whole node ID.

>> You need a backend-local hash table inside the wal reader process, and
>> that hash table needs to map XIDs to node IDs.  And you occasionally
>> need to prune it, so that it doesn't eat too much memory.  None of
>> that sounds very hard.
> Its not very hard. Its just more complex than what I propose(d).

True, but not a whole lot more complex, and a moderate amount of
complexity to save bit-space is a good trade.  Especially when Tom has
come down against eating up the bit space.  And I agree with him.  If
we've only got 16 bits of padding to work with, we surely be judicious
in burning them when it can be avoided for the expense of a few
hundred lines of code.

>> > Btw, what do you mean with "conflating" the stream? I don't really see
>> > that being proposed.
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
> No, I don't want that. I think we will need some different format once we have
> agreed how changeset extraction works.

I think you are saying that you agree with me that the formats should
be different, but that the LCR format is undecided as yet.  If that is
in fact what you are saying, great.  We'll need to decide that, of
course, but I think there is a lot of cool stuff that can be done that
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas

On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne  wrote:
> On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  wrote:
>>> Btw, what do you mean with "conflating" the stream? I don't really see that
>>> being proposed.
>>
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
>
> Yeah, that seems worth elaborating on.
>
> What has been said several times is that it's pretty necessary to
> capture the logical changes into WAL.  That seems pretty needful, in
> order that the replication data gets fsync()ed avidly, and so that we
> don't add in the race condition of needing to fsync() something *else*
> almost exactly as avidly as is the case for WAL today..

Check.

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.
>
> What I *want* you to mean is that there would be:
> a) WAL readers that pull the "physical bits", and
> b) WAL readers that just pull "logical bits."
>
> I expect it would be fine to have a tool that pulls LCRs out of WAL to
> prepare that to be sent to remote locations.  Is that what you have in
> mind?

Yes.  I think it should be possible to generate LCRs from WAL, but I
think that the on-the-wire format for LCRs should be different from
the WAL format.  Trying to use the same format for both things seems
like an unpleasant straightjacket.  This discussion illustrates why:
we're talking about consuming scarce bit-space in WAL records for a
feature that only a tiny minority of users will use, and it's still
not really enough bit space.  That stinks.  If LCR transmission is a
separate protocol, this problem can be engineered away at a higher
level.

Suppose we have three servers, A, B, and C, that are doing
multi-master replication in a loop.  A sends LCRs to B, B sends them
to C, and C sends them back to A.  Obviously, we need to make sure
that each server applies each set of changes just once, but it
suffices to have enough information in WAL to distinguish between
replication transactions and non-replication transactions - that is,
one bit.  So suppose a change is made on server A.  A generates LCRs
from WAL, and tags each LCR with node_id = A.  It then sends those
LCRs to B.  B applies them, flagging the apply transaction in WAL as a
replication transaction, AND ALSO sends the LCRs to C.  The LCR
generator on B sees the WAL from apply, but because it's flagged as a
replication transaction, it does not generate LCRs.  So C receives
LCRs from B just once, without any need for the node_id to to be known
in WAL.  C can now also apply those LCRs (again flagging the apply
transaction as replication) and it can also skip sending them to A,
because it seems that they originated at A.

Now suppose we have a more complex topology.  Suppose we have a
cluster of four servers A .. D which, for improved tolerance against
network outages, are all connected pairwise.  Normally all the links
are up, so each server sends all the LCRs it generates directly to all
other servers.  But how do we prevent cycles?  A generates a change
and sends it to B, C, and D.  B then sees that the change came from A
so it sends it to C and D.  C, receiving that change, sees that came
from A via B, so it sends it to D again, whereupon D, which got it
from C and knows that the origin is A, sends it to B, who will then
send it right back over to D.  Obviously, we've got an infinite loop
here, so this topology will not work.  However, there are several
obvious ways to patch it by changing the LCR protocol.  Most
obviously, and somewhat stupidly, you could add a TTL.  A bit smarter,
you could have each LCR carry a LIST of node_ids that it had already
visited, refusing to send it to any node it had already been to it,
instead of a single node_id.  Smarter still, you could send
handshaking messages around the cluster so that each node can build up
a spanning tree and prefix each LCR it sends with the list of
additional nodes to which the recipient must deliver it.  So,
normally, A would send a message to each of B, C, and D destined only
for that node; but if the A-C link went down, A would choose either B
or D and send each LCR to that node destined for that node *and C*;
then, A would forward the message.  Or perhaps you think this is too
complex and not worth supporting anyway, and that might be true, but
the point is that if you insist that all of the identifying
information must be carried in WAL, you've pretty much ruled it out,
because we are not going to put TTL fields, or lists of node IDs, or
lists of destinations, in WAL.  But there is no reason they can't be
attached to LCRs, whi

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Andres Freund  wrote:

> Yes, thats definitely a valid use-case. But that doesn't preclude
> the other - also not uncommon - use-case where you want to have
> different master which all contain up2date data.

I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote:
> Simon Riggs  wrote:
> > If we use WAL in this way, multi-master implies that the data will
> > *always* be in a loop. So in any configuration we must be able to
> > tell difference between changes made by one node and another.
> 
> Only if you assume that multi-master means identical databases all
> replicating the same data to all the others.  If I have 72 master
> replicating non-conflicting data to one consolidated database, I
> consider that to be multi-master, too.
> ...
> Of course, none of these databases have the same OID for any given
> object, and there are numerous different schemas among the
> replicating databases, so I need to get to table and column names
> before the data is of any use to me.
Yes, thats definitely a valid use-case. But that doesn't preclude the other - 
also not uncommon - use-case where you want to have different master which all 
contain up2date data.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Simon Riggs  wrote:

> The proposal is to use WAL to generate the logical change stream.
> That has been shown in testing to be around x4 faster than having
> a separate change stream, which must also be WAL logged (as Jan
> noted).

Sure, that's why I want it.

> If we use WAL in this way, multi-master implies that the data will
> *always* be in a loop. So in any configuration we must be able to
> tell difference between changes made by one node and another.

Only if you assume that multi-master means identical databases all
replicating the same data to all the others.  If I have 72 master
replicating non-conflicting data to one consolidated database, I
consider that to be multi-master, too.  Especially if I have other
types of databases replicating disjoint data to the same
consolidated database, and the 72 sources have several databases
replicating disjoint sets of data to them.  We have about 1000
replications paths, none of which create a loop which can send data
back to the originator or cause conflicts.

Of course, none of these databases have the same OID for any given
object, and there are numerous different schemas among the
replicating databases, so I need to get to table and column names
before the data is of any use to me.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund  
wrote:
> > More seriously: Even if we don't put MM in core I think putting the basis
> > for it in core so that somebody can build such a solution reusing the
> > existing infrastructure is a sensible idea. Imo the only thing that
> > requires explicit support which is hard to add outside of core is
> > prevention of loops (aka this patch). Everything else should be doable
> > reusing the hopefully modular pieces.
> I don't think prevention of loops is hard to do outside of core
> either, unless you insist on tying "logical" replication so closely to
> WAL that anyone doing MMR is necessarily getting the change stream
> from WAL.  In fact, I'd go so far as to say that the ONLY part of this
> that's hard to do outside of core is change extraction.  Even
> low-level apply can be implemented as a loadable module.
I definitely agree that low-level apply is possible as a module. Sure change 
extraction needs core support but I was talking about what you need to 
implement it reusing the "plain" logical support...

What I do not understand is how you want to prevent loops in a simple manner 
without in core support:

A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT 
action.
B reads the lcr stream from A and applies the changes. A new HEAP_INSERT 
record. Gets decoded into the lcr stream as a INSERT action.
A reads the lcr stream from B and ???

At this point you need to prevent a loop. If you have the information where a 
change originally happened (xl_origin_id = A in this case) you can have the 
simple filter on A which ignores change records if lcr_origin_id == 
local_replication_origin_id).


> >> Right.  If we decide we need this, and if we did decide to conflate
> >> the WAL stream, both of which I disagree with as noted above, then we
> >> still don't need it on every record.  It would probably be sufficient
> >> for local transactions to do nothing at all (and we can implicitly
> >> assume that they have master node ID = local node ID) and transactions
> >> which are replaying remote changes to emit one record per XID per
> >> checkpoint cycle containing the remote node ID.
> > 
> > Youve gone from a pretty trivial 150 line patch without any runtime/space
> > overhead to something *considerably* more complex in that case though.
> > 
> > You suddently need to have relatively complex logic to remember which
> > information you got for a certain xid (and forget that information
> > afterwards) and whether you already logged that xid and you need to have
> > to decide about logging that information at multiple places.
> You need a backend-local hash table inside the wal reader process, and
> that hash table needs to map XIDs to node IDs.  And you occasionally
> need to prune it, so that it doesn't eat too much memory.  None of
> that sounds very hard.
Its not very hard. Its just more complex than what I propose(d).

> > Btw, what do you mean with "conflating" the stream? I don't really see
> > that being proposed.
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.
No, I don't want that. I think we will need some different format once we have 
agreed how changeset extraction works.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 00:11, Tom Lane  wrote:
> Andres Freund  writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

The evidence from prototypes shown at PgCon was that using the WAL in
this way was very efficient and that this level of efficiency is
necessary to make it work in a practical manner. Postgres would not be
the first database to follow this broad design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Agreed. In this case, the origin node id information needs to be
available on the WAL record so we can generate the logical changes
(LCRs).

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

Restricting the number of node ids is being done so that there is no
impact on anybody not using this feature.

In later implementations, it should be possible to support greater
numbers of node ids by having a variable length header. But that is an
unnecessary complication for a first release.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 05:59, Christopher Browne  wrote:

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.

Completely agree.

What we're discussing here is the basic information content of a WAL
record to enable the construction of suitable LCRs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 04:31, Kevin Grittner  wrote:

> I've done a lot of MM replication,
> and so far have not had to use a topology which allowed loops.

The proposal is to use WAL to generate the logical change stream. That
has been shown in testing to be around x4 faster than having a
separate change stream, which must also be WAL logged (as Jan noted).

If we use WAL in this way, multi-master implies that the data will
*always* be in a loop. So in any configuration we must be able to tell
difference between changes made by one node and another.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
> > On 13 June 2012 19:28, Andres Freund  wrote:
> >> This adds a new configuration parameter multimaster_node_id which
> >> determines the id used for wal originating in one cluster.
> > 
> > Looks good and it seems this aspect at least is commitable in this CF.
> > 
> > Design decisions I think we need to review are
> > 
> > * Naming of field. I think origin is the right term, borrowing from
> > Slony.
> 
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
> 
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?
That should be possible. I don't want to go there till we have the base stuff 
in but it shouldn't be too hard.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 05:46, Robert Haas  wrote:

> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

The proposal is to read the WAL in order to generate LCRs. The
information needs to be in the original data in order to allow it to
be passed onwards.

In a multi-master config there will be WAL records with many origin
node ids at any time, so this information must be held at WAL record
level not page level.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Christopher Browne

On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  wrote:
>> Btw, what do you mean with "conflating" the stream? I don't really see that
>> being proposed.
>
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

Yeah, that seems worth elaborating on.

What has been said several times is that it's pretty necessary to
capture the logical changes into WAL.  That seems pretty needful, in
order that the replication data gets fsync()ed avidly, and so that we
don't add in the race condition of needing to fsync() something *else*
almost exactly as avidly as is the case for WAL today..

But it's undesirable to pull *all* the bulk of contents of WAL around
if it's only part of the data that is going to get applied.  On a
"physical streaming" replica, any logical data that gets captured will
be useless.  And on a "logical replica," they "physical" bits of WAL
will be useless.

What I *want* you to mean is that there would be:
a) WAL readers that pull the "physical bits", and
b) WAL readers that just pull "logical bits."

I expect it would be fine to have a tool that pulls LCRs out of WAL to
prepare that to be sent to remote locations.  Is that what you have in
mind?  Or are you feeling that the "logical bits" shouldn't get
captured in WAL altogether, so we need to fsync() them into a
different stream of files?
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 19 June 2012 14:03, Tom Lane  wrote:

> "Every WAL record"?  Why in heck would you attach it to every record?
> Surely putting it in WAL page headers would be sufficient.  We could
> easily afford to burn a page switch (if not a whole segment switch)
> when changing masters.

This does appear to be a reasonable idea at first glance, since it
seems that each node has just a single node id, but that is not the
case.

As we pass changes around we maintain the same origin id for a change,
so there is a mix of origin node ids at the WAL record level, not the
page level. The concept of originating node id is essentially same as
that used in Slony.

> I'm against the idea of eating any spare space we have in WAL record
> headers for this purpose, anyway; there are likely to be more pressing
> needs in future.

Not sure what those pressing needs are, but I can't see any. What we
are doing here is fairly important, just not as important as crash
recovery. But then that has worked pretty much unchanged for some time
now.

I raised the possibility of having variable length headers, but there
is no requirement for that yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas

On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund  wrote:
> More seriously: Even if we don't put MM in core I think putting the basis for
> it in core so that somebody can build such a solution reusing the existing
> infrastructure is a sensible idea. Imo the only thing that requires explicit
> support which is hard to add outside of core is prevention of loops (aka this
> patch). Everything else should be doable reusing the hopefully modular pieces.

I don't think prevention of loops is hard to do outside of core
either, unless you insist on tying "logical" replication so closely to
WAL that anyone doing MMR is necessarily getting the change stream
from WAL.  In fact, I'd go so far as to say that the ONLY part of this
that's hard to do outside of core is change extraction.  Even
low-level apply can be implemented as a loadable module.

>> Right.  If we decide we need this, and if we did decide to conflate
>> the WAL stream, both of which I disagree with as noted above, then we
>> still don't need it on every record.  It would probably be sufficient
>> for local transactions to do nothing at all (and we can implicitly
>> assume that they have master node ID = local node ID) and transactions
>> which are replaying remote changes to emit one record per XID per
>> checkpoint cycle containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any runtime/space
> overhead to something *considerably* more complex in that case though.
>
> You suddently need to have relatively complex logic to remember which
> information you got for a certain xid (and forget that information afterwards)
> and whether you already logged that xid and you need to have to decide about
> logging that information at multiple places.

You need a backend-local hash table inside the wal reader process, and
that hash table needs to map XIDs to node IDs.  And you occasionally
need to prune it, so that it doesn't eat too much memory.  None of
that sounds very hard.

> Btw, what do you mean with "conflating" the stream? I don't really see that
> being proposed.

It seems to me that you are intent on using the WAL stream as the
logical change stream.  I think that's a bad design.  Instead, you
should extract changes from WAL and then ship them around in a format
that is specific to logical replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs

On 20 June 2012 04:58, Marko Kreen  wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
>> On 13 June 2012 19:28, Andres Freund  wrote:
>>> This adds a new configuration parameter multimaster_node_id which determines
>>> the id used for wal originating in one cluster.
>>
>> Looks good and it seems this aspect at least is commitable in this CF.
>>
>> Design decisions I think we need to review are
>>
>> * Naming of field. I think origin is the right term, borrowing from Slony.
>
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
>
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?

Yes, that can be done. This is simply how the numbers are used
internally, similar to oids.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Marko Kreen

On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
> On 13 June 2012 19:28, Andres Freund  wrote:
>> This adds a new configuration parameter multimaster_node_id which determines
>> the id used for wal originating in one cluster.
>
> Looks good and it seems this aspect at least is commitable in this CF.
>
> Design decisions I think we need to review are
>
> * Naming of field. I think origin is the right term, borrowing from Slony.

I have not read too deeply here, so maybe I am missing
some important detail here, but idea that users need
to coordinate a integer config parameter globally does not
sound too attractive to me.

Why not limit integers to local storage only and map
them to string idents on config, UI and transport?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner

Andres Freund  wrote:
> Robert Haas wrote:
>> Tom Lane  wrote:

>>> However, if we're dead set on doing it that way, let us put
>>> information that is only relevant to logical replication records
>>> into only the logical replication records.
>> Right.  If we decide we need this, and if we did decide to
>> conflate the WAL stream, both of which I disagree with as noted
>> above, then we still don't need it on every record.  It would
>> probably be sufficient for local transactions to do nothing at
>> all (and we can implicitly assume that they have master node ID =
>> local node ID) and transactions which are replaying remote
>> changes to emit one record per XID per checkpoint cycle
>> containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any
> runtime/space overhead to something *considerably* more complex in
> that case though. 

I think it might be worth it.  I've done a lot of MM replication,
and so far have not had to use a topology which allowed loops.  Not
only would you be reserving space in the WAL stream which was not
useful for those not using MM replication, you would be reserving it
when even many MM configurations would not need it.  Now you could
argue that the 16 bits you want to use are already there and are not
yet used for anything; but there are two counter-arguments to that:
you lose the opportunity to use them for something else, and you
might want more than 16 bits.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane  wrote:
> > Andres Freund  writes:
> >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>  ...  (If you are thinking
>  of something sufficiently high-level that merging could possibly work,
>  then it's not WAL, and we shouldn't be trying to make the WAL
>  representation cater for it.)
> >> 
> >> Do you really see this as such a big problem?
> > 
> > It looks suspiciously like "I have a hammer, therefore every problem
> > must be a nail".  I don't like the design concept of cramming logical
> > replication records into WAL in the first place.
> 
> Me, neither.  I think it's necessary to try to find a way of
> generating logical replication records from WAL.  But once generated,
> I think those records should form their own stream, independent of
> WAL.  If you take the contrary position that they should be included
> in WAL, then when you filter the WAL stream down to just the records
> of interest to logical replication, you end up with a WAL stream with
> holes in it, which is one of the things that Andres listed as an
> unresolved design problem in his original email.
Yes.

> Moreover, this isn't necessary at all for single-master replication,
> or even multi-source replication where each table has a single master.
>  It's only necessary for full multi-master replication, which we have
> no consensus to include in core, and even if we did have a consensus
> to include it in core, it certainly shouldn't be the first feature we
> design.
Well, you can't blame a patch/prototype trying to implement what it claims to 
implement ;)

More seriously: Even if we don't put MM in core I think putting the basis for 
it in core so that somebody can build such a solution reusing the existing 
infrastructure is a sensible idea. Imo the only thing that requires explicit 
support which is hard to add outside of core is prevention of loops (aka this 
patch). Everything else should be doable reusing the hopefully modular pieces.

> > However, if we're dead set on doing it that way, let us put information
> > that is only relevant to logical replication records into only the
> > logical replication records.
> Right.  If we decide we need this, and if we did decide to conflate
> the WAL stream, both of which I disagree with as noted above, then we
> still don't need it on every record.  It would probably be sufficient
> for local transactions to do nothing at all (and we can implicitly
> assume that they have master node ID = local node ID) and transactions
> which are replaying remote changes to emit one record per XID per
> checkpoint cycle containing the remote node ID.
Youve gone from a pretty trivial 150 line patch without any runtime/space 
overhead to something *considerably* more complex in that case though. 

You suddently need to have relatively complex logic to remember which 
information you got for a certain xid (and forget that information afterwards) 
and whether you already logged that xid and you need to have to decide about 
logging that information at multiple places.

Btw, what do you mean with "conflating" the stream? I don't really see that 
being proposed.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas

On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

Me, neither.  I think it's necessary to try to find a way of
generating logical replication records from WAL.  But once generated,
I think those records should form their own stream, independent of
WAL.  If you take the contrary position that they should be included
in WAL, then when you filter the WAL stream down to just the records
of interest to logical replication, you end up with a WAL stream with
holes in it, which is one of the things that Andres listed as an
unresolved design problem in his original email.

Moreover, this isn't necessary at all for single-master replication,
or even multi-source replication where each table has a single master.
 It's only necessary for full multi-master replication, which we have
no consensus to include in core, and even if we did have a consensus
to include it in core, it certainly shouldn't be the first feature we
design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Right.  If we decide we need this, and if we did decide to conflate
the WAL stream, both of which I disagree with as noted above, then we
still don't need it on every record.  It would probably be sufficient
for local transactions to do nothing at all (and we can implicitly
assume that they have master node ID = local node ID) and transactions
which are replaying remote changes to emit one record per XID per
checkpoint cycle containing the remote node ID.

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

I completely agree.  I think that, as Dan said upthread, having a 64
or 128 bit ID so that it can be generated automatically rather than
configured by an administrator who must be careful not to duplicate
node IDs in any pair of systems that could ever end up talking to each
other would be a vast usability improvement.  Perhaps systems A, B,
and C are replicating to each other today, as are systems D and E.
But now suppose that someone decides they want to replicate one table
between A and D.  Suddenly the node IDs have to be distinct where they
didn't before, and now there's potentially a problem to hassle with
that wouldn't have been an issue if the node IDs had been wide enough
to begin with.  It is not unusual for people to decide after-the-fact
to begin replicating between machines where this wasn't originally
anticipated and which may even be even be under different
administrative control.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

Hi,

On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote:
> Andres Freund  writes:
> > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> >>> ...  (If you are thinking
> >>> of something sufficiently high-level that merging could possibly work,
> >>> then it's not WAL, and we shouldn't be trying to make the WAL
> >>> representation cater for it.)
> > 
> > Do you really see this as such a big problem?
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.
There are - so far - no specific "logical replication records". Its a 
relatively minor amount of additional data under wal_level=logical for 
existing records. HEAP_UPDATE gets the old primary key on updates changing the 
pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE|
DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block 
than the page to handle full page writes. Thats it.

I can definitely understand hesitation about that, but I simply see no 
realistic way to solve the issues of existing replication solutions otherwise.
Do you have a better idea to solve those than the above? Without significant 
complications of the backend code and without loads of additional writes going 
on?
I *really* would like to hear them if you do.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records. 
I found, and still do, the idea of having the origin_id in there rather 
elegant. If people prefer adding the same block to all of the above xlog 
records: I can live with that and will then do so. It makes some things more 
complicated, but its not too bad.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane

Andres Freund  writes:
> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>>> ...  (If you are thinking
>>> of something sufficiently high-level that merging could possibly work,
>>> then it's not WAL, and we shouldn't be trying to make the WAL
>>> representation cater for it.)

> Do you really see this as such a big problem?

It looks suspiciously like "I have a hammer, therefore every problem
must be a nail".  I don't like the design concept of cramming logical
replication records into WAL in the first place.

However, if we're dead set on doing it that way, let us put information
that is only relevant to logical replication records into only the
logical replication records.  Saving a couple bytes in each such record
is penny-wise and pound-foolish, I'm afraid; especially when you're
nailing down hard, unexpansible limits at the very beginning of the
development process in order to save those bytes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> Andres Freund  writes:
> > On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
> >> ...  (If you are thinking
> >> of something sufficiently high-level that merging could possibly work,
> >> then it's not WAL, and we shouldn't be trying to make the WAL
> >> representation cater for it.)
> > 
> > The idea is that if youre replaying changes on node A originating from
> > node B you set the origin to *B* in the wal records that are generated
> > during that. So when B, in a bidirectional setup, replays the changes
> > that A has made it can simply ignore all changes which originated on
> > itself.
> 
> This is most certainly not possible at the level of WAL. 
Huh? This isn't used during normal crash-recovery replay. The information is 
used when decoding the wal into logical changes and applying those. Its just a 
common piece of information thats needed for a large number of records. 

Alternatively it could be added to all the records that need it, but that 
would smear the necessary logic - which is currently trivial - over more of 
the backend. And it would increase the actual size of wal which this one did 
not.

> As I said above, we shouldn't be trying to shoehorn high level logical-
> replication commands into WAL streams.  No good can come of confusing those
> concepts.
Its not doing anything high-level in there? All that patch does is embedding 
one single piece of information in previously unused space.

I can follow the argument that you do not want *any* logical information in 
the wal. But as I said in the patchset-introducing email: I don't really see 
an alternative. Otherwise we would just duplicate all the locking/scalability 
issues of xlog as well as the amount of writes.
This is, besides logging some more informations when wal_level = logical in 
some particular records (HEAP_UPDATE|DELETE and ensuring fpw's don't remove 
the record data in HEAP_(INSERT|UPDATE|DELETE) in patch 07/16 the only change 
that I really forsee being needed for doing the logical stuff.

Do you really see this as such a big problem?

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane

Andres Freund  writes:
> On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
>> ...  (If you are thinking
>> of something sufficiently high-level that merging could possibly work,
>> then it's not WAL, and we shouldn't be trying to make the WAL
>> representation cater for it.)

> The idea is that if youre replaying changes on node A originating from node B
> you set the origin to *B* in the wal records that are generated during that. 
> So when B, in a bidirectional setup, replays the changes that A has made it 
> can simply ignore all changes which originated on itself.

This is most certainly not possible at the level of WAL.  As I said
above, we shouldn't be trying to shoehorn high level logical-replication
commands into WAL streams.  No good can come of confusing those concepts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 04:17:01 PM Tom Lane wrote:
> Andres Freund  writes:
> > On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
> >> "Every WAL record"?  Why in heck would you attach it to every record?
> >> Surely putting it in WAL page headers would be sufficient.
> > 
> > The idea is that you can have cascading, circular and whatever
> > replication topologies if you include the "logical origin" of a wal
> > causing action into it.
> > That is, if you have nodes A(1) and B(2) and a insert happens on A the
> > wal records generated by that will get an xl_origin_id = 1 and when it
> > will be decoded and replayed on B it will *also* get the id 1. Only when
> > a change originally is generated on Bit will get xl_origin_id = 2.
> 
> None of this explains to me why each individual WAL record would need to
> carry a separate copy of the indicator.  We don't have multiple masters
> putting records into the same WAL file, nor does it seem to me that it
> could possibly be workable to merge WAL streams.  (If you are thinking
> of something sufficiently high-level that merging could possibly work,
> then it's not WAL, and we shouldn't be trying to make the WAL
> representation cater for it.)
The idea is that if youre replaying changes on node A originating from node B 
you set the origin to *B* in the wal records that are generated during that. 
So when B, in a bidirectional setup, replays the changes that A has made it 
can simply ignore all changes which originated on itself.

That works rather simple & performant if you have a conflict avoidance scheme.

For many scenarios you need to be able to separate locally generated changes 
and changes that have been replayed from another node. Without support of 
something like this this is really hard to achieve.

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane

Andres Freund  writes:
> On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
>> "Every WAL record"?  Why in heck would you attach it to every record?
>> Surely putting it in WAL page headers would be sufficient.

> The idea is that you can have cascading, circular and whatever replication 
> topologies if you include the "logical origin" of a wal causing action into 
> it.
> That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
> records generated by that will get an xl_origin_id = 1 and when it will be 
> decoded and replayed on B it will *also* get the id 1. Only when a change 
> originally is generated on Bit will get xl_origin_id = 2.

None of this explains to me why each individual WAL record would need to
carry a separate copy of the indicator.  We don't have multiple masters
putting records into the same WAL file, nor does it seem to me that it
could possibly be workable to merge WAL streams.  (If you are thinking
of something sufficiently high-level that merging could possibly work,
then it's not WAL, and we shouldn't be trying to make the WAL
representation cater for it.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 04:12:47 AM Steve Singer wrote:
> On 12-06-18 07:30 AM, Andres Freund wrote:
> > Hrmpf #666. I will go through through the series commit-by-commit again
> > to make sure everything compiles again. Reordinging this late definitely
> > wasn't a good idea...
> > 
> > I pushed a rebased version with all those fixups (and removal of the
> > zeroRecPtr patch).
> 
> Where did you push that rebased version to? I don't see an attachment,
> or an updated patch in the commitfest app and your repo at
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summa
> ry hasn't been updated in 5 days.
To the 2ndquadrant internal repo. Which strangely doesn't help you. 
*Headdesk*. Pushed to the correct repo and manually verified.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

Hi,

On Tuesday, June 19, 2012 08:03:04 AM Tom Lane wrote:
> Andres Freund  writes:
> > On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
> >> What's the cost of going a lot higher?  Because if one makes enough
> >> numerical space available, one can assign node identities without a
> >> coordinator, a massive decrease in complexity.
> > 
> > It would increase the size of every wal record. We just have 16bit left
> > there by chance...
> 
> "Every WAL record"?  Why in heck would you attach it to every record?
> Surely putting it in WAL page headers would be sufficient.  We could
> easily afford to burn a page switch (if not a whole segment switch)
> when changing masters.
The idea is that you can have cascading, circular and whatever replication 
topologies if you include the "logical origin" of a wal causing action into 
it.
That is, if you have nodes A(1) and B(2) and a insert happens on A the wal 
records generated by that will get an xl_origin_id = 1 and when it will be 
decoded and replayed on B it will *also* get the id 1. Only when a change 
originally is generated on Bit will get xl_origin_id = 2.
That way you can easily have circular or hierarchical replication topologies 
including diamonds.

> I'm against the idea of eating any spare space we have in WAL record
> headers for this purpose, anyway; there are likely to be more pressing
> needs in future.
Every other solution to allowing this seems to be far more complicated than 
this, thats why I arrived at the conclusion that its a good idea.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund

On Tuesday, June 19, 2012 12:47:54 AM Christopher Browne wrote:
> On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund  
wrote:
> > Hi Simon,
> > 
> > On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
> >> On 13 June 2012 19:28, Andres Freund  wrote:
> >> > This adds a new configuration parameter multimaster_node_id which
> >> > determines the id used for wal originating in one cluster.
> >> 
> >> Looks good and it seems this aspect at least is commitable in this CF.
> > 
> > I think we need to agree on the parameter name. It currently is
> > 'multimaster_node_id'. In the discussion with Steve we got to
> > "replication_node_id". I don't particularly like either.
> > 
> > Other suggestions?
> I wonder if it should be origin_node_id?  That is the term Slony uses.
I think in the slony context its clear that thats related to replication. Less 
so in a general postgres guc. So maybe replication_origin_node_id?

> >> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> >> quite a lot. Do we need that many? I think we may have need for a few
> >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> >> 8 bits. Even if we don't need them in this release, I'd like to have
> >> them. If they remain unused after a few releases, we may choose to
> >> redeploy some of them as additional nodeids in future. I don't foresee
> >> complaints that 256 master nodes is too few anytime soon, so we can
> >> defer that decision.
> > 
> > I wished we had some flag bits available before as well. I find 256 nodes
> > a pretty low value to start with though, 4096 sounds better though, so I
> > would be happy with 4 flag bits. I think for cascading setups and such
> > you want to add node ids for every node, not only masters...
> Even though the number of nodes that can reasonably participate in
> replication is likely to be not too terribly large, it might be good
> to allow larger values, in case someone is keen on encoding something
> descriptive in the node number.
Well, having a large number space makes the wal records bigger which is not 
something I can see us getting through with. People have gone through 
considerable length to avoid that.
Perhaps we can have a mapping system catalog at some point that includes 
additional infromation to each node id like a name and where its at wrt 
replication...

> I recall the Slony-II project having a notion of attaching a permanent
> UUID-based node ID to each node.  As long as there is somewhere decent
> to find a symbolically significant node "name," I like the idea of the
> ID *not* being in a tiny range, and being UUID/OID-like...
I think adding 14 bytes (16bytes of an ooid - 2 bytes available) is out of 
question...

> > Any opinions from others on this?
> > 
> >> * Do we want origin_id as a parameter or as a setting in pgcontrol?
> >> IIRC we go to a lot of trouble elsewhere to avoid problems with
> >> changing on/off parameter values. I think we need some discussion to
> >> validate where that should live.
> > 
> > Hm. I don't really forsee any need to have it in pg_control. What do you
> > want to protect against with that?
> > It would need to be changeable anyway, because otherwise it would need to
> > become a parameter for initdb which would suck for anybody migrating to
> > use replication at some point.
> > 
> > Do you want to protect against problems in replication setups after
> > changing the value?
> In Slony, changing the node ID is Not Something That Is Done.  The ID
> is captured in *way* too many places to be able to have any hope of
> updating it in a coordinated way.  I should be surprised if it wasn't
> similarly troublesome here.
If you update you will need to reset consuming nodes, yes. Imo thats still 
something else than disallowing changing the parameter entirely. Requiring 
initdb for that seems like it would make experimentation too hard.

We need to allow at least changing the setting from no node id to an initial 
one.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Tom Lane

Andres Freund  writes:
> On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
>> What's the cost of going a lot higher?  Because if one makes enough
>> numerical space available, one can assign node identities without a
>> coordinator, a massive decrease in complexity.

> It would increase the size of every wal record. We just have 16bit left there
> by chance...

"Every WAL record"?  Why in heck would you attach it to every record?
Surely putting it in WAL page headers would be sufficient.  We could
easily afford to burn a page switch (if not a whole segment switch)
when changing masters.

I'm against the idea of eating any spare space we have in WAL record
headers for this purpose, anyway; there are likely to be more pressing
needs in future.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Steve Singer


On 12-06-18 11:50 AM, Andres Freund wrote:

Hi Simon,



I think we need to agree on the parameter name. It currently is
'multimaster_node_id'. In the discussion with Steve we got to
"replication_node_id". I don't particularly like either.

Other suggestions?

Other things that come to mind (for naming this parameter in the 
postgresql.conf)


node_id
origin_node_id
local_node_id

I wished we had some flag bits available before as well. I find 256 nodes a
pretty low value to start with though, 4096 sounds better though, so I would
be happy with 4 flag bits. I think for cascading setups and such you want to
add node ids for every node, not only masters...

Any opinions from others on this?



256 sounds a bit low to me as well.  Sometimes the use case of a retail 
chain comes up where people want each store to have a postgresql 
instance and replicate back to a central office.  I can think of many 
chains with more than 256 stores.





Thanks,

Andres



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Steve Singer


On 12-06-18 07:30 AM, Andres Freund wrote:


Hrmpf #666. I will go through through the series commit-by-commit again to
make sure everything compiles again. Reordinging this late definitely wasn't a
good idea...

I pushed a rebased version with all those fixups (and removal of the
zeroRecPtr patch).


Where did you push that rebased version to? I don't see an attachment, 
or an updated patch in the commitfest app and your repo at 
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary 
hasn't been updated in 5 days.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Christopher Browne

On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund  wrote:
> Hi Simon,
>
> On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
>> On 13 June 2012 19:28, Andres Freund  wrote:
>> > This adds a new configuration parameter multimaster_node_id which
>> > determines the id used for wal originating in one cluster.
>>
>> Looks good and it seems this aspect at least is commitable in this CF.
> I think we need to agree on the parameter name. It currently is
> 'multimaster_node_id'. In the discussion with Steve we got to
> "replication_node_id". I don't particularly like either.
>
> Other suggestions?

I wonder if it should be origin_node_id?  That is the term Slony uses.

>> Design decisions I think we need to review are
>>
>> * Naming of field. I think origin is the right term, borrowing from Slony.
> I think it fits as well.
>
>> * Can we add the origin_id dynamically to each WAL record? Probably no
>> need, but lets consider why and document that.
> Not sure what you mean? Its already set in XLogInsert to
> current_replication_origin_id which defaults to the value of the guc?
>
>> * Size of field. 16 bits is enough for 32,000 master nodes, which is
>> quite a lot. Do we need that many? I think we may have need for a few
>> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
>> 8 bits. Even if we don't need them in this release, I'd like to have
>> them. If they remain unused after a few releases, we may choose to
>> redeploy some of them as additional nodeids in future. I don't foresee
>> complaints that 256 master nodes is too few anytime soon, so we can
>> defer that decision.
> I wished we had some flag bits available before as well. I find 256 nodes a
> pretty low value to start with though, 4096 sounds better though, so I would
> be happy with 4 flag bits. I think for cascading setups and such you want to
> add node ids for every node, not only masters...

Even though the number of nodes that can reasonably participate in
replication is likely to be not too terribly large, it might be good
to allow larger values, in case someone is keen on encoding something
descriptive in the node number.

If you restrict the number to a tiny range, then you'll be left
wanting some other mapping.  At one point, I did some work trying to
get a notion of named nodes implemented in Slony; gave up on it, as
the coordination process was wildly too bug-prone.

In our environment, at Afilias, we have used quasi-symbolic node
numbers that encoded something somewhat meaningful about the
environment.  That seems better to me than the risky "kludge" of
saying:
- The first node I created is node #1
- The second one is node #2.
- The third and fourth are #3 and #4
- I dropped node #2 due to a problem, and thus the "new node 2" is #5.

That numbering scheme gets pretty anti-intuitive fairly quickly, from
whence we took the approach of having a couple digits indicating data
centre followed by a digit indicating which node in that data centre.

If that all sounds incoherent, well, the more nodes you have around,
the more difficult it becomes to make sure you *do* have a coherent
picture of your cluster.

I recall the Slony-II project having a notion of attaching a permanent
UUID-based node ID to each node.  As long as there is somewhere decent
to find a symbolically significant node "name," I like the idea of the
ID *not* being in a tiny range, and being UUID/OID-like...

> Any opinions from others on this?
>
>> * Do we want origin_id as a parameter or as a setting in pgcontrol?
>> IIRC we go to a lot of trouble elsewhere to avoid problems with
>> changing on/off parameter values. I think we need some discussion to
>> validate where that should live.
> Hm. I don't really forsee any need to have it in pg_control. What do you want
> to protect against with that?
> It would need to be changeable anyway, because otherwise it would need to
> become a parameter for initdb which would suck for anybody migrating to use
> replication at some point.
>
> Do you want to protect against problems in replication setups after changing
> the value?

In Slony, changing the node ID is Not Something That Is Done.  The ID
is captured in *way* too many places to be able to have any hope of
updating it in a coordinated way.  I should be surprised if it wasn't
similarly troublesome here.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund

On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
> On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund  
wrote:
> >> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> >> quite a lot. Do we need that many? I think we may have need for a few
> >> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> >> 8 bits. Even if we don't need them in this release, I'd like to have
> >> them. If they remain unused after a few releases, we may choose to
> >> redeploy some of them as additional nodeids in future. I don't foresee
> >> complaints that 256 master nodes is too few anytime soon, so we can
> >> defer that decision.
> > 
> > I wished we had some flag bits available before as well. I find 256 nodes
> > a pretty low value to start with though, 4096 sounds better though, so I
> > would be happy with 4 flag bits. I think for cascading setups and such
> > you want to add node ids for every node, not only masters...
> > 
> > Any opinions from others on this?
> 
> What's the cost of going a lot higher?  Because if one makes enough
> numerical space available, one can assign node identities without a
> coordinator, a massive decrease in complexity.
It would increase the size of every wal record. We just have 16bit left there 
by chance...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Daniel Farina

On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund  wrote:

>> * Size of field. 16 bits is enough for 32,000 master nodes, which is
>> quite a lot. Do we need that many? I think we may have need for a few
>> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
>> 8 bits. Even if we don't need them in this release, I'd like to have
>> them. If they remain unused after a few releases, we may choose to
>> redeploy some of them as additional nodeids in future. I don't foresee
>> complaints that 256 master nodes is too few anytime soon, so we can
>> defer that decision.
> I wished we had some flag bits available before as well. I find 256 nodes a
> pretty low value to start with though, 4096 sounds better though, so I would
> be happy with 4 flag bits. I think for cascading setups and such you want to
> add node ids for every node, not only masters...
>
> Any opinions from others on this?

What's the cost of going a lot higher?  Because if one makes enough
numerical space available, one can assign node identities without a
coordinator, a massive decrease in complexity.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund

Hi Simon,

On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
> On 13 June 2012 19:28, Andres Freund  wrote:
> > This adds a new configuration parameter multimaster_node_id which
> > determines the id used for wal originating in one cluster.
> 
> Looks good and it seems this aspect at least is commitable in this CF.
I think we need to agree on the parameter name. It currently is 
'multimaster_node_id'. In the discussion with Steve we got to 
"replication_node_id". I don't particularly like either.

Other suggestions?

> Design decisions I think we need to review are
> 
> * Naming of field. I think origin is the right term, borrowing from Slony.
I think it fits as well.

> * Can we add the origin_id dynamically to each WAL record? Probably no
> need, but lets consider why and document that.
Not sure what you mean? Its already set in XLogInsert to 
current_replication_origin_id which defaults to the value of the guc?

> * Size of field. 16 bits is enough for 32,000 master nodes, which is
> quite a lot. Do we need that many? I think we may have need for a few
> flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
> 8 bits. Even if we don't need them in this release, I'd like to have
> them. If they remain unused after a few releases, we may choose to
> redeploy some of them as additional nodeids in future. I don't foresee
> complaints that 256 master nodes is too few anytime soon, so we can
> defer that decision.
I wished we had some flag bits available before as well. I find 256 nodes a 
pretty low value to start with though, 4096 sounds better though, so I would 
be happy with 4 flag bits. I think for cascading setups and such you want to 
add node ids for every node, not only masters...

Any opinions from others on this?

> * Do we want origin_id as a parameter or as a setting in pgcontrol?
> IIRC we go to a lot of trouble elsewhere to avoid problems with
> changing on/off parameter values. I think we need some discussion to
> validate where that should live.
Hm. I don't really forsee any need to have it in pg_control. What do you want 
to protect against with that?
It would need to be changeable anyway, because otherwise it would need to 
become a parameter for initdb which would suck for anybody migrating to use 
replication at some point.

Do you want to protect against problems in replication setups after changing 
the value?

> * Is there any overhead from CRC of WAL record because of this? I'd
> guess not, but just want to double check thinking.
I cannot imagine that there is. The actual size of the record didn't change 
because of alignment padding (both on 32 and 64 bit systems). 

> Presumably there is no issue wrt Heikki's WAL changes? I assume not,
> but ask since I know you're reviewing that also.
It might clash minimally because of offset changes or such, but otherwise 
there shouldn't be much.

Thanks,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Simon Riggs

On 13 June 2012 19:28, Andres Freund  wrote:

> This adds a new configuration parameter multimaster_node_id which determines
> the id used for wal originating in one cluster.

Looks good and it seems this aspect at least is commitable in this CF.

Design decisions I think we need to review are

* Naming of field. I think origin is the right term, borrowing from Slony.

* Can we add the origin_id dynamically to each WAL record? Probably no
need, but lets consider why and document that.

* Size of field. 16 bits is enough for 32,000 master nodes, which is
quite a lot. Do we need that many? I think we may have need for a few
flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
8 bits. Even if we don't need them in this release, I'd like to have
them. If they remain unused after a few releases, we may choose to
redeploy some of them as additional nodeids in future. I don't foresee
complaints that 256 master nodes is too few anytime soon, so we can
defer that decision.

* Do we want origin_id as a parameter or as a setting in pgcontrol?
IIRC we go to a lot of trouble elsewhere to avoid problems with
changing on/off parameter values. I think we need some discussion to
validate where that should live.

* Is there any overhead from CRC of WAL record because of this? I'd
guess not, but just want to double check thinking.

Presumably there is no issue wrt Heikki's WAL changes? I assume not,
but ask since I know you're reviewing that also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund

On Monday, June 18, 2012 02:43:26 AM Steve Singer wrote:
> On 12-06-13 01:27 PM, Andres Freund wrote:
> > The previous mail contained a patch with a mismerge caused by reording
> > commits. Corrected version attached.
> > 
> > Thanks to Steve Singer for noticing this quickly.
> 
> Attached is a more complete review of this patch.
> 
> I agree that we will need to identify the node a change originated at.
> We will not only want this for multi-master support but it might also be
> very helpful once we introduce things like cascaded replicas. Using a 16
> bit integer for this purpose makes sense to me.
Good.

> This patch (with the previous numbered patches already applied), still
> doesn't compile.
> 
> gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
> -D_GNU_SOURCE   -c -o xact.o xact.c
> xact.c: In function 'xact_redo_commit':
> xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn'
> make[4]: *** [xact.o] Error 1
> 
> Your complete patch set did compile.  origin_lsn gets added as part of
> your 12'th patch.  Managing so many related patches is going to be a
> pain. but it beats one big patch.  I don't think this patch actually
> requires the origin_lsn change.
Hrmpf #666. I will go through through the series commit-by-commit again to 
make sure everything compiles again. Reordinging this late definitely wasn't a 
good idea...

I pushed a rebased version with all those fixups (and removal of the 
zeroRecPtr patch).
> 
> Code Review
> -
> src/backend/utils/misc/guc.c
> @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =
>   },
> 
>   {
> +{"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER,
> +gettext_noop("node id for multimaster."),
> +NULL
> +},
> + &guc_replication_origin_id,
> +InvalidMultimasterNodeId, InvalidMultimasterNodeId,
> MaxMultimasterNodeId,
> +NULL, assign_replication_node_id, NULL
> +},
> 
> I'd rather see us refer to this as the 'node id for logical replication'
> over the multimaster node id.  I think that terminology will be less
> controversial. 
Youre right. 'replication_node_id' or such should be ok?

> BootStrapXLOG in xlog.c
> creates a XLogRecord structure and shouldit  set xl_origin_id to the
> InvalidMultimasterNodeId?
> WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a
> well defined value.  I think InvalidMultimasterNodeId should be safe
> even for a no-op record in from a node that actually has a node_id set
> on real records.
Good catches.


> backend/replication/logical/logical.c:
> XLogRecPtr current_replication_origin_lsn = {0, 0};
> 
> This variable isn't used/referenced by this patch it probably belongs as
> part of the later patch.
Yea, just as the usage of origin_lsn in the above compile failure.


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-17 Thread Steve Singer


On 12-06-13 01:27 PM, Andres Freund wrote:

The previous mail contained a patch with a mismerge caused by reording
commits. Corrected version attached.

Thanks to Steve Singer for noticing this quickly.



Attached is a more complete review of this patch.

I agree that we will need to identify the node a change originated at.  
We will not only want this for multi-master support but it might also be 
very helpful once we introduce things like cascaded replicas. Using a 16 
bit integer for this purpose makes sense to me.


This patch (with the previous numbered patches already applied), still 
doesn't compile.


gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include 
-D_GNU_SOURCE   -c -o xact.o xact.c

xact.c: In function 'xact_redo_commit':
xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn'
make[4]: *** [xact.o] Error 1

Your complete patch set did compile.  origin_lsn gets added as part of 
your 12'th patch.  Managing so many related patches is going to be a 
pain. but it beats one big patch.  I don't think this patch actually 
requires the origin_lsn change.



Code Review
-
src/backend/utils/misc/guc.c
@@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =
 },

 {
+{"multimaster_node_id", PGC_POSTMASTER, REPLICATION_MASTER,
+gettext_noop("node id for multimaster."),
+NULL
+},
+ &guc_replication_origin_id,
+InvalidMultimasterNodeId, InvalidMultimasterNodeId, 
MaxMultimasterNodeId,

+NULL, assign_replication_node_id, NULL
+},

I'd rather see us refer to this as the 'node id for logical replication' 
over the multimaster node id.  I think that terminology will be less 
controversial.  Multi-master means different things to different people 
and it is still unclear what forms of multi-master we will have 
in-core.  For example,  most people don't consider slony to be 
multi-master replication.  If a future version of slony were to feed off 
logical replication (instead of triggers) then I think it would need 
this node id to determine which node a particular change has come from.


The description inside the gettext call should probably be "Sets the 
node id for ." to be consistent with the description of the rest of 
the GUC's


BootStrapXLOG in xlog.c
creates a XLogRecord structure and shouldit  set xl_origin_id to the  
InvalidMultimasterNodeId?


WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a 
well defined value.  I think InvalidMultimasterNodeId should be safe 
even for a no-op record in from a node that actually has a node_id set 
on real records.


backend/replication/logical/logical.c:
XLogRecPtr current_replication_origin_lsn = {0, 0};

This variable isn't used/referenced by this patch it probably belongs as 
part of the later patch.



Steve


Andres

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-13 Thread Andres Freund

The previous mail contained a patch with a mismerge caused by reording 
commits. Corrected version attached.

Thanks to Steve Singer for noticing this quickly.

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From efb28b1f931a30738faac83703810b598a82a6ee Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 9 Jun 2012 14:49:34 +0200
Subject: [PATCH] Introduce the concept that wal has a 'origin' node

One solution to avoid loops when doing wal based logical replication in
topologies which are more complex than one unidirectional transport is
introducing the concept of a 'origin_id' into the wal stream. Luckily there is
some padding in the XLogRecord struct that allows us to add that field without
further bloating the struct.
This solution was chosen because it allows for just about any topology and is
inobstrusive.

This adds a new configuration parameter multimaster_node_id which determines
the id used for wal originating in one cluster.

When applying changes from wal from another cluster code can set the variable
current_replication_origin_id. This is a global variable because passing it
through everything which can generate wal would be far to intrusive.
---
 src/backend/access/transam/xact.c |   54 +
 src/backend/access/transam/xlog.c |3 +-
 src/backend/access/transam/xlogreader.c   |2 +
 src/backend/replication/logical/Makefile  |2 +-
 src/backend/replication/logical/logical.c |   19 +
 src/backend/utils/misc/guc.c  |   19 +
 src/backend/utils/misc/postgresql.conf.sample |3 ++
 src/include/access/xlog.h |4 +-
 src/include/access/xlogdefs.h |2 +
 src/include/replication/logical.h |   22 ++
 10 files changed, 100 insertions(+), 30 deletions(-)
 create mode 100644 src/backend/replication/logical/logical.c
 create mode 100644 src/include/replication/logical.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3cc2bfa..79ec19a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -36,8 +36,9 @@
 #include "libpq/be-fsstubs.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "replication/walsender.h"
+#include "replication/logical.h"
 #include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -4545,12 +4546,13 @@ xactGetCommittedChildren(TransactionId **ptr)
  * actions for which the order of execution is critical.
  */
 static void
-xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
-		  TransactionId *sub_xids, int nsubxacts,
-		  SharedInvalidationMessage *inval_msgs, int nmsgs,
-		  RelFileNode *xnodes, int nrels,
-		  Oid dbId, Oid tsId,
-		  uint32 xinfo)
+xact_redo_commit_internal(TransactionId xid, RepNodeId originating_node,
+  XLogRecPtr lsn, XLogRecPtr origin_lsn,
+  TransactionId *sub_xids, int nsubxacts,
+  SharedInvalidationMessage *inval_msgs, int nmsgs,
+  RelFileNode *xnodes, int nrels,
+  Oid dbId, Oid tsId,
+  uint32 xinfo)
 {
 	TransactionId max_xid;
 	int			i;
@@ -4659,8 +4661,8 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  * Utility function to call xact_redo_commit_internal after breaking down xlrec
  */
 static void
-xact_redo_commit(xl_xact_commit *xlrec,
- TransactionId xid, XLogRecPtr lsn)
+xact_redo_commit(xl_xact_commit *xlrec, RepNodeId originating_node,
+			TransactionId xid, XLogRecPtr lsn)
 {
 	TransactionId *subxacts;
 	SharedInvalidationMessage *inval_msgs;
@@ -4670,27 +4672,26 @@ xact_redo_commit(xl_xact_commit *xlrec,
 	/* invalidation messages array follows subxids */
 	inval_msgs = (SharedInvalidationMessage *) &(subxacts[xlrec->nsubxacts]);
 
-	xact_redo_commit_internal(xid, lsn, subxacts, xlrec->nsubxacts,
-			  inval_msgs, xlrec->nmsgs,
-			  xlrec->xnodes, xlrec->nrels,
-			  xlrec->dbId,
-			  xlrec->tsId,
-			  xlrec->xinfo);
+	xact_redo_commit_internal(xid, originating_node, lsn, xlrec->origin_lsn,
+	  subxacts, xlrec->nsubxacts, inval_msgs,
+	  xlrec->nmsgs, xlrec->xnodes, xlrec->nrels,
+	  xlrec->dbId, xlrec->tsId, xlrec->xinfo);
 }
 
 /*
  * Utility function to call xact_redo_commit_internal  for compact form of message.
  */
 static void
-xact_redo_commit_compact(xl_xact_commit_compact *xlrec,
-		 TransactionId xid, XLogRecPtr lsn)
+xact_redo_commit_compact(xl_xact_commit_compact *xlrec, RepNodeId originating_node,
+			TransactionId xid, XLogRecPtr lsn)
 {
-	xact_redo_commit_internal(xid, lsn, xlrec->subxacts, xlrec->nsubxact

99 matches

Mail list logo