Re: [HACKERS] Replication identifiers, take 4

2015-04-29 Thread David Rowley
On 25 April 2015 at 00:32, Andres Freund and...@anarazel.de wrote:


 Attached is a patch that does this, and some more, renaming. That was
 more work than I'd imagined.  I've also made the internal naming in
 origin.c more consistent/simpler and did a bunch of other cleanup.


Hi,

It looks like bowerbird is not too happy with this, neither is my build
environment.

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-04-29%2019%3A31%3A06

The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the
DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.

Regards

David Rowley


UINT16_MAX_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-29 Thread Petr Jelinek

On 29/04/15 22:12, David Rowley wrote:

On 25 April 2015 at 00:32, Andres Freund and...@anarazel.de
mailto:and...@anarazel.de wrote:


Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined.  I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.


Hi,

It looks like bowerbird is not too happy with this, neither is my build
environment.

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-04-29%2019%3A31%3A06

The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the
DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.



I think correct fix is using PG_UINT16_MAX.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-29 Thread Robert Haas
On Tue, Apr 28, 2015 at 10:00 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 4/24/15 4:29 PM, Andres Freund wrote:
 Shouldn't this be backed up by pg_dump(all?)?

 Given it deals with LSNs and is, quite fundamentally due to concurrency, non 
 transactional, I doubt it's worth it. The other side's slots also aren't 
 going to be backed up as pg dump obviously can't know about then. So the 
 represented data won't make much sense.

 I agree it might not be the best match.  But we should consider that we
 used to say, a backup by pg_dumpall plus configuration files is a
 complete backup.  Now we have replication slots and possibly replication
 identifiers and maybe similar features in the future that are not
 covered by this backup method.

That's true.  But if you did backup the replication slots with
pg_dump, and then you restored them as part of restoring the dump,
your replication setup would be just as broken as if you had never
backed up those replication slots at all.

I think the problem here is that replication slots are part of
*cluster* configuration, not individual node configuration.  If you
back up a set of nodes that make up a cluster, and then restore them,
you might hope that you will end up with working slots established
between the same pairs of machines that had working slots between them
before.  But I don't see a way to make that happen when you look at it
from the point of view of backing up and restoring just one node.  As
we get better clustering facilities into core, we may develop more
instances of this problem; the best way of solving it is not clear to
me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Petr Jelinek

On 24/04/15 14:32, Andres Freund wrote:

On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:

On 04/17/2015 11:54 AM, Andres Freund wrote:

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for external 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.


Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about replication identifier as the new
fundamental concept. The system table is called pg_replication_identifier.
But that's like talking about index identifiers, instead of just indexes,
and calling the system table pg_index_oid.

The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I think
we should roughly do a search-replace of replication identifier -
replication origin to the patch. Or even transaction origin.


Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined.  I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.



There are few oversights in the renaming:

doc/src/sgml/func.sgml:
+Return the replay position for the passed in replication
+identifier. The parameter parameterflush/parameter

src/include/replication/origin.h:
+ * replication_identifier.h

+extern PGDLLIMPORT RepOriginId replident_sesssion_origin;
+extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn;
+extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp;

(these are used then in multiple places in code afterwards and also 
mentioned in comment above replorigin_advance)


src/backend/replication/logical/origin.c:
+   ereport(ERROR,
+   (errcode(ERRCODE_OBJECT_IN_USE),
+errmsg(replication identiefer

+   default:
+   elog(PANIC, replident_redo: unknown op code

+ * This function may only be called if a origin was setup with
+ * replident_session_setup().

I also think the replident_checkpoint file should be renamed to 
replorigin_checkpoint.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Andres Freund
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin logging
 to the relevant XLogInsert() callsites for external 2 byte identifiers
 and removes the pad-reusing version in the interest of moving forward.
 
 Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
 
 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.
 
 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.

Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined.  I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.

I'm pretty happy with this state.

Greetings,

Andres Freund
From fc406a87f2e2ac08a7ac112ef6b75be1e8256a16 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Fri, 24 Apr 2015 14:25:56 +0200
Subject: [PATCH] Introduce replication progress tracking infrastructure.
 (v2.0)

When implementing a replication solution ontop of logical decoding two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented in this commit, consist
out of three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies.

Most of this could also be implemented in userspace, e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all former capabilities, except
that there's only 2^16 different origins; but now they integrate with
logical decoding. Additionally more functionality is accessible via SQL.
Since the commit timestamp infrastructure has also been introduced in
9.5 that's not a problem.

For now the number of origins for which the replication progress can be
tracked is determined by the max_replication_slots GUC. That GUC is not
a perfect match to configure this, but there doesn't seem to be
sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Robert Haas, Steve Singer
Discussion: 20150216002155.gi15...@awork2.anarazel.de,
20140923182422.ga15...@alap3.anarazel.de,
20131114172632.ge7...@alap2.anarazel.de
---
 contrib/test_decoding/Makefile  |3 +-
 contrib/test_decoding/expected/replorigin.out   |  141 +++
 contrib/test_decoding/sql/replorigin.sql|   64 +
 contrib/test_decoding/test_decoding.c   |   28 +
 doc/src/sgml/catalogs.sgml  |  123 ++
 doc/src/sgml/filelist.sgml  |1 +
 doc/src/sgml/func.sgml  |  201 ++-
 doc/src/sgml/logicaldecoding.sgml   |   35 +-
 doc/src/sgml/postgres.sgml  |1 +
 doc/src/sgml/replication-origins.sgml   |   93 ++
 src/backend/access/heap/heapam.c|   19 +
 src/backend/access/rmgrdesc/Makefile|4 +-
 src/backend/access/rmgrdesc/replorigindesc.c|   61 +
 src/backend/access/rmgrdesc/xactdesc.c  |   24 +-
 src/backend/access/transam/commit_ts.c  |   53 +-
 src/backend/access/transam/rmgr.c   |1 +
 src/backend/access/transam/xact.c   |   72 +-
 src/backend/access/transam/xlog.c   |8 +
 src/backend/access/transam/xloginsert.c |   27 +-
 src/backend/access/transam/xlogreader.c |6 +
 src/backend/catalog/Makefile|2 +-
 src/backend/catalog/catalog.c   |8 +-
 src/backend/catalog/system_views.sql|7 +
 src/backend/replication/logical/Makefile|3 +-
 src/backend/replication/logical/decode.c|   49 +-
 

Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Andres Freund
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut pete...@gmx.net 
wrote:
On 4/24/15 8:32 AM, Andres Freund wrote:
 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin
logging
 to the relevant XLogInsert() callsites for external 2 byte
identifiers
 and removes the pad-reusing version in the interest of moving
forward.

 Putting aside the 2 vs. 4 byte identifier issue, let's discuss
naming:

 I just realized that it talks about replication identifier as the
new
 fundamental concept. The system table is called
pg_replication_identifier.
 But that's like talking about index identifiers, instead of just
indexes,
 and calling the system table pg_index_oid.

 The important concept this patch actually adds is the *origin* of
each
 transaction. That term is already used in some parts of the patch. I
think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.
 
 Attached is a patch that does this, and some more, renaming. That was
 more work than I'd imagined.  I've also made the internal naming in
 origin.c more consistent/simpler and did a bunch of other cleanup.
 
 I'm pretty happy with this state.

Shouldn't this be backed up by pg_dump(all?)?


Given it deals with LSNs and is, quite fundamentally due to concurrency, non 
transactional, I doubt it's worth it. The other side's slots also aren't going 
to be backed up as pg dump obviously can't know about then. So the represented 
data won't make much sense.


Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-24 Thread Peter Eisentraut
On 4/24/15 8:32 AM, Andres Freund wrote:
 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 On 04/17/2015 11:54 AM, Andres Freund wrote:
 I've attached a rebased patch, that adds decision about origin logging
 to the relevant XLogInsert() callsites for external 2 byte identifiers
 and removes the pad-reusing version in the interest of moving forward.

 Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.

 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.
 
 Attached is a patch that does this, and some more, renaming. That was
 more work than I'd imagined.  I've also made the internal naming in
 origin.c more consistent/simpler and did a bunch of other cleanup.
 
 I'm pretty happy with this state.

Shouldn't this be backed up by pg_dump(all?)?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-23 Thread Andres Freund
On 2015-03-24 22:22:29 +0100, Petr Jelinek wrote:
 Perhaps we should have some Logical replication developer documentation
 section and put all those three as subsections of that?

So I just played around with this and it didn't find it
worthwhile. Primarily because there's lots of uses of logical decoding
besides building a logical replication solution. I've reverted to
putting it into a separate chapter 'besides' logical decoding.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-22 Thread Petr Jelinek

On 21/04/15 22:36, Andres Freund wrote:

On 2015-04-21 16:26:08 -0400, Robert Haas wrote:

On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote:

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
   from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
   details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : manually set the replication
   progress to a value. Primarily useful for copying values from other
   systems and such.
* pg_replication_progress_get : How far did replay progress for a
   certain origin
* pg_get_replication_progress : SRF returning the replay progress for
   all origin.

Any comments?


Why are we using functions for this rather than DDL?


Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.



I think the only value of having DDL for this would be consistency 
(catalog objects are created via DDL) as it looks like something that 
will be called only by extensions and not users during normal operation.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Simon Riggs
On 20 April 2015 at 09:28, Andres Freund and...@anarazel.de wrote:

 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
  I just realized that it talks about replication identifier as the new
  fundamental concept. The system table is called
 pg_replication_identifier.
  But that's like talking about index identifiers, instead of just
 indexes,
  and calling the system table pg_index_oid.
 
  The important concept this patch actually adds is the *origin* of each
  transaction. That term is already used in some parts of the patch. I
 think
  we should roughly do a search-replace of replication identifier -
  replication origin to the patch. Or even transaction origin.

 Sounds good to me.


+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Alvaro Herrera
Andres Freund wrote:

 I'm working on changing this (I've implemented the missing WAL
 bits). I'd like to discuss the new terms for a sec, before I go and
 revise the docs.
 
 I'm now calling the feature 'replication progress tracking'. There's
 replication origins and there's progress tracking infrastructure that
 tracks how far data from a replication origin has replicated.

Sounds good to me.

 Catalog wise there's an actual table 'pg_replication_origin' that maps
 between 'roident' and 'roname'. There's a pg_replication_progress view
 (used to be named pg_replication_identifier_progress). I'm not sure if
 the latter name isn't too generic? Maybe
 pg_logical_replication_progress?

I think if we wanted pg_logical_replication_progress (and I don't
really agree that we do) then we would add the logical bit to the
names above as well.  This seems unnecessary.  pg_replication_progress
seems okay to me.

 I've now named the functions:
 
 * pg_replication_origin_create
 * pg_replication_origin_drop
 * pg_replication_origin_get (map from name to id)
 * pg_replication_progress_setup_origin : configure session to replicate
   from a specific origin
 * pg_replication_progress_reset_origin
 * pg_replication_progress_is_replaying : Is a origin configured for the 
 session
 * pg_replication_progress_advance : manually set the replication
   progress to a value. Primarily useful for copying values from other
   systems and such.

These all look acceptable to me.

 * pg_replication_progress_get : How far did replay progress for a
   certain origin
 * pg_get_replication_progress : SRF returning the replay progress for
   all origin.

This combination seems confusing.  In some other thread not too long ago
there was the argument that all functions 'get' something, so that verb
should not appear in the function name.  That would call for
pg_replication_progress on the singleton.  Maybe to distinguish the
SRF, add all as a suffix?

 * pg_replication_progress_setup_tx_details : configure per transaction
   details (LSN and timestamp currently)

Not sure about the tx here.  We use xact as an abbreviation for
transaction in most places.  If nowadays we don't like that term,
maybe just spell out transaction in full.  I assume this function
pairs up with pg_replication_progress_setup_origin, yes?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-20 10:28:02 +0200, Andres Freund wrote:
 On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
  I just realized that it talks about replication identifier as the new
  fundamental concept. The system table is called pg_replication_identifier.
  But that's like talking about index identifiers, instead of just indexes,
  and calling the system table pg_index_oid.
 
  The important concept this patch actually adds is the *origin* of each
  transaction. That term is already used in some parts of the patch. I think
  we should roughly do a search-replace of replication identifier -
  replication origin to the patch. Or even transaction origin.

 Sounds good to me.

I'm working on changing this (I've implemented the missing WAL
bits). I'd like to discuss the new terms for a sec, before I go and
revise the docs.

I'm now calling the feature 'replication progress tracking'. There's
replication origins and there's progress tracking infrastructure that
tracks how far data from a replication origin has replicated.

Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
  from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
  details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : manually set the replication
  progress to a value. Primarily useful for copying values from other
  systems and such.
* pg_replication_progress_get : How far did replay progress for a
  certain origin
* pg_get_replication_progress : SRF returning the replay progress for
  all origin.

Any comments?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote:
 Andres Freund wrote:
  Catalog wise there's an actual table 'pg_replication_origin' that maps
  between 'roident' and 'roname'. There's a pg_replication_progress view
  (used to be named pg_replication_identifier_progress). I'm not sure if
  the latter name isn't too generic? Maybe
  pg_logical_replication_progress?
 
 I think if we wanted pg_logical_replication_progress (and I don't
 really agree that we do) then we would add the logical bit to the
 names above as well.  This seems unnecessary.  pg_replication_progress
 seems okay to me.

Cool.

  * pg_replication_progress_get : How far did replay progress for a
certain origin
  * pg_get_replication_progress : SRF returning the replay progress for
all origin.
 
 This combination seems confusing.  In some other thread not too long ago
 there was the argument that all functions 'get' something, so that verb
 should not appear in the function name.

 That would call for pg_replication_progress on the singleton.

Hm. I don't like that. That'd e.g. clash with the above view. I think
it's good to distinguish between functions (that have a verb in the
name) and views/tables (that don't).

I agree that the above combination isn't optimal. Although pg_get (and
pg_stat_get) is what's used for a lot of other SRF backed views. Maybe
naming the SRF pg_get_all_replication_progress?

  * pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
 
 Not sure about the tx here.  We use xact as an abbreviation for
 transaction in most places.

Oh, yea. Xact is more consistent.

 If nowadays we don't like that term, maybe just spell out
 transaction in full.  I assume this function pairs up with
 pg_replication_progress_setup_origin, yes?

pg_replication_progress_setup_origin sets up the per session state,
setup_xact_details the per replayed transaction state.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Robert Haas
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote:
 I've now named the functions:

 * pg_replication_origin_create
 * pg_replication_origin_drop
 * pg_replication_origin_get (map from name to id)
 * pg_replication_progress_setup_origin : configure session to replicate
   from a specific origin
 * pg_replication_progress_reset_origin
 * pg_replication_progress_setup_tx_details : configure per transaction
   details (LSN and timestamp currently)
 * pg_replication_progress_is_replaying : Is a origin configured for the 
 session
 * pg_replication_progress_advance : manually set the replication
   progress to a value. Primarily useful for copying values from other
   systems and such.
 * pg_replication_progress_get : How far did replay progress for a
   certain origin
 * pg_get_replication_progress : SRF returning the replay progress for
   all origin.

 Any comments?

Why are we using functions for this rather than DDL?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-21 Thread Andres Freund
On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
 On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote:
  I've now named the functions:
 
  * pg_replication_origin_create
  * pg_replication_origin_drop
  * pg_replication_origin_get (map from name to id)
  * pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
  * pg_replication_progress_reset_origin
  * pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
  * pg_replication_progress_is_replaying : Is a origin configured for the 
  session
  * pg_replication_progress_advance : manually set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
  * pg_replication_progress_get : How far did replay progress for a
certain origin
  * pg_get_replication_progress : SRF returning the replay progress for
all origin.
 
  Any comments?
 
 Why are we using functions for this rather than DDL?

Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Andres Freund
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
 I just realized that it talks about replication identifier as the new
 fundamental concept. The system table is called pg_replication_identifier.
 But that's like talking about index identifiers, instead of just indexes,
 and calling the system table pg_index_oid.
 
 The important concept this patch actually adds is the *origin* of each
 transaction. That term is already used in some parts of the patch. I think
 we should roughly do a search-replace of replication identifier -
 replication origin to the patch. Or even transaction origin.

Sounds good to me.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:45 PM, Petr Jelinek wrote:

The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend
would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
don't do that.

The change does nothing useful, since I doubt anyone will ever need
  32768 nodes in their cluster.


And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually 65534 as it's
unsigned btw).


Can you name some of the bigger problems you'd have?

Obviously, if you have 10 high-volume OLTP nodes connected to a 
single server, feeding transactions as a continous stream, you're going 
to choke the system. But you might have 10 tiny satellite databases 
that sync up with the master every few hours, and each of them do only a 
few updates per day.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Andres Freund
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote:
 Can you name some of the bigger problems you'd have?

Several parts of the system are O(#max_replication_slots). Having 100k
outgoing logical replication slots is going to be expensive.

Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be
the biggest problem.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:36 PM, Simon Riggs wrote:

On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote:

To be honest, I'm not entirely sure what we're arguing over.


When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.


Sure, I'm not going to throw a tantrum if Andres commits this as it is.


I said that IMO the difference in WAL size is so small that we should just
use 4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more like
all other system objects we have. Andreas did some pgbench benchmarking to
show that the difference in WAL size is about 10%. The WAL records
generated by pgbench happen to have just the right sizes so that the 2-3
extra bytes bump them over to the next alignment boundary. That's why there
is such a big difference - on average it'll be less. I think that's
acceptable, Andreas seems to think otherwise. But if the WAL size really is
so precious, we could remove the two padding bytes from XLogRecord, instead
of dedicating them for the replication ids. That would be an even better
use for them.


The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.


That's a straw man argument. I'm not saying we should never use 2 byte 
values anywhere. OID is usually used as the primary key in system 
tables. There are exceptions, but that is nevertheless the norm. I'm 
saying that saving in WAL size is not worth making an exception here, 
and we should go with the simplest option of using OIDs.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-20 Thread Heikki Linnakangas

On 04/17/2015 11:54 AM, Andres Freund wrote:

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for external 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward.


Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about replication identifier as the new 
fundamental concept. The system table is called 
pg_replication_identifier. But that's like talking about index 
identifiers, instead of just indexes, and calling the system table 
pg_index_oid.


The important concept this patch actually adds is the *origin* of each 
transaction. That term is already used in some parts of the patch. I 
think we should roughly do a search-replace of replication identifier 
- replication origin to the patch. Or even transaction origin.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-17 Thread Simon Riggs
On 17 April 2015 at 09:54, Andres Freund and...@anarazel.de wrote:


 Hrmpf. Says the person that used a lot of padding, without much
 discussion, for the WAL level infrastructure making pg_rewind more
 maintainable.


Sounds bad. What padding are we talking about?

--
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Replication identifiers, take 4

2015-04-17 Thread Simon Riggs
On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote:


 To be honest, I'm not entirely sure what we're arguing over.


When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.

I didn't much like pg_rewind, but it doesn't hurt and you like it, so I
didn't object. We've all got better things to do.


 I said that IMO the difference in WAL size is so small that we should just
 use 4-byte OIDs for the replication identifiers, instead of trying to make
 do with 2 bytes. Not because I find it too likely that you'll run out of
 IDs (although it could happen), but more to make replication IDs more like
 all other system objects we have. Andreas did some pgbench benchmarking to
 show that the difference in WAL size is about 10%. The WAL records
 generated by pgbench happen to have just the right sizes so that the 2-3
 extra bytes bump them over to the next alignment boundary. That's why there
 is such a big difference - on average it'll be less. I think that's
 acceptable, Andreas seems to think otherwise. But if the WAL size really is
 so precious, we could remove the two padding bytes from XLogRecord, instead
 of dedicating them for the replication ids. That would be an even better
 use for them.


The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.

The change does nothing useful, since I doubt anyone will ever need 32768
nodes in their cluster.

Increasing WAL size for any non-zero amount is needlessly wasteful for a
change with only cosmetic value. But for a change that has significant
value for database resilience, it is a sensible use of bytes.

+1 to Andres' very reasonable suggestion. Lets commit this and go home.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Replication identifiers, take 4

2015-04-17 Thread Petr Jelinek

On 17/04/15 22:36, Simon Riggs wrote:


I said that IMO the difference in WAL size is so small that we
should just use 4-byte OIDs for the replication identifiers, instead
of trying to make do with 2 bytes. Not because I find it too likely
that you'll run out of IDs (although it could happen), but more to
make replication IDs more like all other system objects we have.
Andreas did some pgbench benchmarking to show that the difference in
WAL size is about 10%. The WAL records generated by pgbench happen
to have just the right sizes so that the 2-3 extra bytes bump them
over to the next alignment boundary. That's why there is such a big
difference - on average it'll be less. I think that's acceptable,
Andreas seems to think otherwise. But if the WAL size really is so
precious, we could remove the two padding bytes from XLogRecord,
instead of dedicating them for the replication ids. That would be an
even better use for them.


The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend
would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
don't do that.

The change does nothing useful, since I doubt anyone will ever need
 32768 nodes in their cluster.



And if they did there would be other much bigger problems than 
replication identifier being 16bit (it's actually 65534 as it's 
unsigned btw).


Considering the importance and widespread use of replication I think we 
should really make sure the related features have small overhead.



Increasing WAL size for any non-zero amount is needlessly wasteful for a
change with only cosmetic value. But for a change that has significant
value for database resilience, it is a sensible use of bytes.

+1 to Andres' very reasonable suggestion. Lets commit this and go home.



+1

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-12 Thread Heikki Linnakangas

On 04/12/2015 02:56 AM, Petr Jelinek wrote:

On 10/04/15 18:03, Andres Freund wrote:

On 2015-04-07 17:08:16 +0200, Andres Freund wrote:

I'm starting benchmarks now.


What I'm benchmarking here is the WAL overhead, since that's what we're
debating.

The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.

I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%

To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.



My opinion is that 10% of WAL size difference is quite high price to pay
so that we can keep the padding for some other, yet unknown feature that
hasn't come up in several years, which would need those 2 bytes.

But if we are willing to pay it then we can really go all the way and
just use Oids...


This needs to be weighed against removing the padding bytes altogether. 
See attached. That would reduce the WAL size further when you don't need 
replication IDs. It's very straightforward, but need to do some 
performance/scalability testing to make sure that using memcpy instead 
of a straight 32-bit assignment doesn't hurt performance, since it 
happens in very performance critical paths.


I'm surprised there's such a big difference between the extern and 
padding versions above. At a quick approximation, storing the ID as a 
separate fragment, along with XLogRecordDataHeaderShort and 
XLogRecordDataHeaderLong, should add one byte of overhead plus the ID 
itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 
bytes for 4-byte identifiers. Does that mean that the average record 
length is only about 30 bytes? That's what it seems like, if adding the 
extern 2 byte identifiers added about 10% of overhead compared to the 
padding 2 byte identifiers version. That doesn't sound right, 30 bytes 
is very little. Perhaps the size of the records created by pgbench 
happen to cross a 8-byte alignment boundary at that point, making a big 
difference. In another workload, there might be no difference at all, 
due to alignment.


Also, you don't need to tag every record type with the replication ID. 
All indexam records can skip it, for starters, since logical decoding 
doesn't care about them. That should remove a fair amount of bloat.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24cf520..09934f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -963,10 +963,10 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		 * Now that xl_prev has been filled in, calculate CRC of the record
 		 * header.
 		 */
-		rdata_crc = rechdr-xl_crc;
+		memcpy(rdata_crc, rechdr-xl_crc, sizeof(pg_crc32));
 		COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
 		FIN_CRC32C(rdata_crc);
-		rechdr-xl_crc = rdata_crc;
+		memcpy(rechdr-xl_crc, rdata_crc, sizeof(pg_crc32));
 
 		/*
 		 * All the record data, including the header, is now ready to be
@@ -4685,7 +4685,7 @@ BootStrapXLOG(void)
 	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record-xl_tot_len - SizeOfXLogRecord);
 	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
-	record-xl_crc = crc;
+	memcpy(record-xl_crc, crc, sizeof(pg_crc32));
 
 	/* Create first XLOG segment file */
 	use_existent = false;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 88209c3..fbe97b1 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -724,7 +724,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr-xl_info = info;
 	rechdr-xl_rmid = rmid;
 	rechdr-xl_prev = InvalidXLogRecPtr;
-	rechdr-xl_crc = rdata_crc;
+	memcpy(rechdr-xl_crc, rdata_crc, sizeof(pg_crc32));
 
 	return hdr_rdt;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a4124d9..80a48ba 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -666,6 +666,9 @@ static bool
 ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 {
 	pg_crc32	crc;
+	pg_crc32	rec_crc;
+
+	memcpy(rec_crc, record-xl_crc, sizeof(pg_crc32));
 
 	/* Calculate the CRC */
 	INIT_CRC32C(crc);
@@ -674,7 +677,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 	COMP_CRC32C(crc, (char *) record, 

Re: [HACKERS] Replication identifiers, take 4

2015-04-11 Thread Petr Jelinek

On 10/04/15 18:03, Andres Freund wrote:

On 2015-04-07 17:08:16 +0200, Andres Freund wrote:

I'm starting benchmarks now.


What I'm benchmarking here is the WAL overhead, since that's what we're
debating.

The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.

I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%

To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.



My opinion is that 10% of WAL size difference is quite high price to pay 
so that we can keep the padding for some other, yet unknown feature that 
hasn't come up in several years, which would need those 2 bytes.


But if we are willing to pay it then we can really go all the way and 
just use Oids...


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Andres Freund
On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:
 And finally I have issue with how the new identifiers are allocated.
 Currently, if you create identifier 'foo', remove identifier 'foo' and
 create identifier 'bar', the identifier 'bar' will have same id as the old
 'foo' identifier. This can be problem because the identifier id is used as
 origin of the data and the replication solution using the replication
 identifiers can end up thinking that data came from node 'bar' even though
 they came from the node 'foo' which no longer exists. This can have bad
 effects for example on conflict detection or debugging problems with
 replication.
 
 Maybe another reason to use standard Oids?

As the same reason exists for oids, just somewhat less likely, I don't
see it as a reason for much. It's really not that hard to get oid
conflicts once your server has lived for a while. As soon as the oid
counter has wrapped around once, it's not unlikely to have
conflicts. And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point. The only material
difference this makes is that it's much easier to notice the problem
during development.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Andres Freund
On 2015-03-24 23:11:26 -0400, Robert Haas wrote:
 On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund and...@2ndquadrant.com wrote:
  At a quick glance, this basic design seems workable. I would suggest
  expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
  small price to pay, to make it work more like everything else in the 
  system.
 
  I don't know. Growing from 3 to 5 byte overhead per relevant record (or
  even 0 to 5 in case the padding is reused) is rather noticeable. If we
  later find it to be a limit (I seriously doubt that), we can still
  increase it in a major release without anybody really noticing.
 
 You might notice that Heikki is making the same point here that I've
 attempted to make multiple times in the past: limiting to replication
 identifier to 2 bytes because that's how much padding space you happen
 to have available is optimizing for the wrong thing.  What we should
 be optimizing for is consistency and uniformity of design.  System
 catalogs have OIDs, so this one should, too.  You're not going to be
 able to paper over the fact that the column has some funky data type
 that is unlike every other column in the system.
 
 To the best of my knowledge, the statement that there is a noticeable
 performance cost for those 2 extra bytes is also completely
 unsupported by any actual benchmarking.

I'm starting benchmarks now.

But I have to say: I find the idea that you'd need more than 2^16
identifiers anytime soon not very credible. The likelihood that
replication identifiers are the limiting factor towards that seems
incredibly small. Just consider how you'd apply changes from so many
remotes; how to stream changes to them; how to even configure such a
complex setup. We can easily change the size limits in the next major
release without anybody being inconvenienced.

We've gone through quite some lengths reducing the overhead of WAL. I
don't understand why it's important that we do not make compromises
here; but why that doesn't matter elsewhere.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Andres Freund
On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
 And with temp tables (or much more extremely WITH OID tables)
 and such it's not that hard to reach that point.

Oh, and obviously toast data. A couple tables with toasted columns is
also a good way to rapidly consume oids.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Jim Nasby

On 4/7/15 9:30 AM, Andres Freund wrote:

On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:

And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the old
'foo' identifier. This can be problem because the identifier id is used as
origin of the data and the replication solution using the replication
identifiers can end up thinking that data came from node 'bar' even though
they came from the node 'foo' which no longer exists. This can have bad
effects for example on conflict detection or debugging problems with
replication.

Maybe another reason to use standard Oids?


As the same reason exists for oids, just somewhat less likely, I don't
see it as a reason for much. It's really not that hard to get oid
conflicts once your server has lived for a while. As soon as the oid
counter has wrapped around once, it's not unlikely to have
conflicts. And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point. The only material
difference this makes is that it's much easier to notice the problem
during development.


Why not just create a sequence? I suspect it may not be as fast to 
assign as an OID, but it's not like you'd be doing this all the time.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Andres Freund
 Why not just create a sequence? I suspect it may not be as fast to assign as
 an OID, but it's not like you'd be doing this all the time.

What does that have to do with the thread?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Jim Nasby

On 4/7/15 10:58 AM, Andres Freund wrote:

Why not just create a sequence? I suspect it may not be as fast to assign as
an OID, but it's not like you'd be doing this all the time.


What does that have to do with the thread?


The original bit was...


And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the old
'foo' identifier. This can be problem because the identifier id is used as
origin of the data and the replication solution using the replication
identifiers can end up thinking that data came from node 'bar' even though
they came from the node 'foo' which no longer exists. This can have bad
effects for example on conflict detection or debugging problems with
replication.

Maybe another reason to use standard Oids?


Wasn't the reason for using OIDs so that we're not doing the equivalent 
of max(identifier) + 1?


Perhaps I'm just confused...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-04-07 Thread Michael Paquier
On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
 And with temp tables (or much more extremely WITH OID tables)
 and such it's not that hard to reach that point.

 Oh, and obviously toast data. A couple tables with toasted columns is
 also a good way to rapidly consume oids.

You are forgetting as well large objects on the stack, when client
application does not assign an OID by itself.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-03-28 Thread Petr Jelinek

So I did some more in depth look, I do have couple of comments.

I would really like to have something like Logical Replication 
Infrastructure doc section that would have both decoding and 
identifiers (and possibly even CommitTs) underneath.


There is typo in docs:

+ para
+   The optional functionfilter_by_origin_cb/function callback
+   is called to determine wheter data that has been replayed


wheter - whether


And finally I have issue with how the new identifiers are allocated. 
Currently, if you create identifier 'foo', remove identifier 'foo' and 
create identifier 'bar', the identifier 'bar' will have same id as the 
old 'foo' identifier. This can be problem because the identifier id is 
used as origin of the data and the replication solution using the 
replication identifiers can end up thinking that data came from node 
'bar' even though they came from the node 'foo' which no longer exists. 
This can have bad effects for example on conflict detection or debugging 
problems with replication.


Maybe another reason to use standard Oids?

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-03-24 Thread Robert Haas
On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund and...@2ndquadrant.com wrote:
 At a quick glance, this basic design seems workable. I would suggest
 expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
 small price to pay, to make it work more like everything else in the system.

 I don't know. Growing from 3 to 5 byte overhead per relevant record (or
 even 0 to 5 in case the padding is reused) is rather noticeable. If we
 later find it to be a limit (I seriously doubt that), we can still
 increase it in a major release without anybody really noticing.

You might notice that Heikki is making the same point here that I've
attempted to make multiple times in the past: limiting to replication
identifier to 2 bytes because that's how much padding space you happen
to have available is optimizing for the wrong thing.  What we should
be optimizing for is consistency and uniformity of design.  System
catalogs have OIDs, so this one should, too.  You're not going to be
able to paper over the fact that the column has some funky data type
that is unlike every other column in the system.

To the best of my knowledge, the statement that there is a noticeable
performance cost for those 2 extra bytes is also completely
unsupported by any actual benchmarking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-03-24 Thread Andres Freund
Hi,

Here's the next version of this patch. I've tried to address the biggest
issue (documentation) and some more. Now that both the more flexible
commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
looks much cleaner.

Changes:
* Loads of documentation and comments
* Revamped locking strategy. There's now a LWLock protecting all the
  replication progress array and spinlock for the individual sltos.
* Simpler checkpoint format.
* A new pg_replication_identifier_progress() function returning a
  individual identifier's replication progress; previously there was
  only the view showing all of them.
* Lots of minor cleanup
* Some more tests

I'd greatly appreciate some feedback on the documentation. I'm not
entirely sure into how much detail to go; and where exactly in the docs
to place it. I do wonder if we shouldn't merge this with the logical
decoding section and whether we could also document commit timestamps
somewhere in there.

I've verified that this correctly works on a stanby; replication
progress is replicated correctly.  I think there's two holes though:
Changing the replication progress without replicating anything and
dropping a replication identifier with some replication progress might
not work correctly. That's fairly easily fixed and I intend to do so.

Other than that I'm not aware of outstanding issues.

Comments?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 32ff44f3f045b0805301af8d94d1f11336c6ce8f Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 15 Mar 2015 18:11:01 +0100
Subject: [PATCH] Introduce replication identifiers: v1.0

Replication identifiers are used to identify nodes in a replication
setup, identify changes that are created due to replication and to keep
track of replication progress.

Primarily this is useful because solving these in other ways is
possible, but ends up being much less efficient and more complicated. We
don't want to require replication solutions to reimplement logic for
this independently. The infrastructure is intended to be generic enough
to be reusable.

This infrastructure replaces the 'nodeid' infrastructure of commit
timestamps. Except that there's only 2^16 identifiers, the
infrastructure provided here integrates with logical replication and is
available via SQL. Since the commit timestamp infrastructure has also
been introduced in 9.5 that's not a problem.

For now the number of nodes whose replication progress can be tracked is
determined by the max_replication_slots GUC. It's not perfect to reuse
that GUC, but there doesn't seem to be sufficient reason to introduce a
separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Robert Haas, Heikki Linnakangas, Steve Singer
Discussion: 20150216002155.gi15...@awork2.anarazel.de,
20140923182422.ga15...@alap3.anarazel.de,
20131114172632.ge7...@alap2.anarazel.de
---
 contrib/test_decoding/Makefile |3 +-
 contrib/test_decoding/expected/replident.out   |  127 ++
 contrib/test_decoding/sql/replident.sql|   58 +
 contrib/test_decoding/test_decoding.c  |   28 +
 doc/src/sgml/catalogs.sgml |  124 ++
 doc/src/sgml/filelist.sgml |1 +
 doc/src/sgml/func.sgml |  162 ++-
 doc/src/sgml/logicaldecoding.sgml  |   35 +-
 doc/src/sgml/postgres.sgml |1 +
 doc/src/sgml/replication-identifiers.sgml  |   89 ++
 src/backend/access/rmgrdesc/xactdesc.c |   24 +-
 src/backend/access/transam/commit_ts.c |   53 +-
 src/backend/access/transam/xact.c  |   69 +-
 src/backend/access/transam/xlog.c  |8 +
 src/backend/access/transam/xloginsert.c|   22 +-
 src/backend/access/transam/xlogreader.c|   10 +
 src/backend/catalog/Makefile   |2 +-
 src/backend/catalog/catalog.c  |8 +-
 src/backend/catalog/system_views.sql   |7 +
 src/backend/replication/logical/Makefile   |3 +-
 src/backend/replication/logical/decode.c   |   48 +-
 src/backend/replication/logical/logical.c  |   33 +
 src/backend/replication/logical/reorderbuffer.c|5 +-
 .../replication/logical/replication_identifier.c   | 1300 
 src/backend/storage/ipc/ipci.c |3 +
 src/backend/utils/cache/syscache.c |   23 +
 src/bin/pg_resetxlog/pg_resetxlog.c|5 +
 src/include/access/commit_ts.h |   14 +-
 src/include/access/xact.h  |   11 +
 src/include/access/xlog.h  |1 +
 src/include/access/xlog_internal.h |2 

Re: [HACKERS] Replication identifiers, take 4

2015-03-24 Thread Petr Jelinek

On 24/03/15 16:33, Andres Freund wrote:

Hi,

Here's the next version of this patch. I've tried to address the biggest
issue (documentation) and some more. Now that both the more flexible
commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
looks much cleaner.



Nice, I see you also did the more close integration with CommitTs. I 
only skimmed the patch but so far and it looks quite good, I'll take 
closer look around end of the week.




I'd greatly appreciate some feedback on the documentation. I'm not
entirely sure into how much detail to go; and where exactly in the docs
to place it. I do wonder if we shouldn't merge this with the logical
decoding section and whether we could also document commit timestamps
somewhere in there.


Perhaps we should have some Logical replication developer documentation 
section and put all those three as subsections of that?



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-22 Thread Petr Jelinek

On 22/02/15 09:57, Andres Freund wrote:

On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:

On 16/02/15 10:46, Andres Freund wrote:

On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:


At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.


I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.



I agree that limiting the overhead is important.

But I have related though about this - I wonder if it's worth to try to map
this more directly to the CommitTsNodeId.


Maybe. I'd rather go the other way round though;
replication_identifier.c/h's stuff seems much more generic than
CommitTsNodeId.



Probably not more generic but definitely nicer as the nodes are named 
and yes it has richer API.



I mean it is currently using that
for the actual storage, but I am thinking of having the Replication
Identifiers be the public API and the CommitTsNodeId stuff be just hidden
implementation detail. This thought is based on the fact that CommitTsNodeId
provides somewhat meaningless number while Replication Identifiers give us
nice name for that number. And if we do that then the type should perhaps be
same for both?


I'm not sure. Given that this is included in a significant number of
recordsd I'd really rather not increase the overhead as described
above. Maybe we can just limit CommitTsNodeId to the same size for now?



That would make sense.

I also wonder about the default nodeid infrastructure the committs 
provides. I can't think of use-case which it can be used for and which 
isn't solved by replication identifiers - in the end the main reason I 
added that infra was to make it possible to write something like 
replication identifiers as part of an extension. In any case I don't 
think the default nodeid can be used in parallel with replication 
identifiers since one will overwrite the SLRU record for the other. 
Maybe it's enough if this is documented but I think it might be better 
if this patch removed that default committs nodeid infrastructure. It's 
just few lines of code which nobody should be using yet.


Thinking about this some more and rereading the code I see that you are 
setting TransactionTreeSetCommitTsData during xlog replay, but not 
during transaction commit, that does not seem correct as the local 
records will not have any nodeid/origin.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-22 Thread Peter Eisentraut
On 2/15/15 7:24 PM, Andres Freund wrote:
 On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
 Here's my next attept attempt at producing something we can agree
 upon.

 The major change that might achieve that is that I've now provided a
 separate method to store the origin_id of a node. I've made it
 conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
 paths. That new method uses Heikki's xlog rework to dynamically add the
 origin to the record if a origin is set up. That works surprisingly
 simply.

I'm trying to figure out what this feature is supposed to do, but the
patch contains no documentation or a commit message.  Where is one
supposed to start?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-22 Thread Andres Freund
On 2015-02-22 10:03:59 -0500, Peter Eisentraut wrote:
 On 2/15/15 7:24 PM, Andres Freund wrote:
  On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
  Here's my next attept attempt at producing something we can agree
  upon.
 
  The major change that might achieve that is that I've now provided a
  separate method to store the origin_id of a node. I've made it
  conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
  paths. That new method uses Heikki's xlog rework to dynamically add the
  origin to the record if a origin is set up. That works surprisingly
  simply.
 
 I'm trying to figure out what this feature is supposed to do, but the
 patch contains no documentation or a commit message.  Where is one
 supposed to start?

For a relatively short summary:
http://archives.postgresql.org/message-id/20150216002155.GI15326%40awork2.anarazel.de

For a longer one:
http://www.postgresql.org/message-id/20140923182422.ga15...@alap3.anarazel.de

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-22 Thread Andres Freund
On 2015-02-22 04:59:30 +0100, Petr Jelinek wrote:
 Now that the issue with padding seems to no longer exists since the patch
 works both with and without padding, I went through the code and here are
 some comments I have (in no particular order).
 
 In CheckPointReplicationIdentifier:
 + * FIXME: Add a CRC32 to the end.
 
 The function already does it (I guess you forgot to remove the comment).

Yep. I locally have a WIP version that's much cleaned up and doesn't
contain it anymore.

 Using max_replication_slots as limit for replication_identifier states does
 not really make sense to me as replication_identifiers track remote info
 while and slots are local and in case of master-slave replication you need
 replication identifiers but don't need slots.

On the other hand, it's quite cheap if unused. Not sure if several
variables are worth it.

 In bootstrap.c:
  #define MARKNOTNULL(att) \
  ((att)-attlen  0 || \
   (att)-atttypid == OIDVECTOROID || \
 - (att)-atttypid == INT2VECTOROID)
 + (att)-atttypid == INT2VECTOROID || \
 + strcmp(NameStr((att)-attname), riname) == 0 \
 +)
 
 Huh? Can this be solved in a nicer way?

Yes. I'd mentioned that this is just a temporary hack ;). I've since
pushed a more proper solution; BKI_FORCE_NOT_NULL can now be specified
on the column.

 Since we call XLogFlush with local_lsn as parameter, shouldn't we check that
 it's actually within valid range?
 Currently we'll get errors like this if set to invalid value:
 ERROR:  xlog flush request 123/123 is not satisfied --- flushed only to
 0/168FB18

I think we should rather remove local_lsn from all parameters that are
user callable, adding them there was a mistake. It's really only
relevant for the cases where it's called by commit.

 In AdvanceReplicationIndentifier:
 +/*
 + * XXX: should we restore into a hashtable and dump into shmem only 
 after
 + * recovery finished?
 + */
 
 Probably no given that the function is also callable via SQL interface.

Well, it's still a good idea regardless...

 As I wrote in another email, I would like to integrate the RepNodeId and
 CommitTSNodeId into single thing.

Will reply separately there.

 There are no docs for the new sql interfaces.

Yea. The whole thing needs docs.


Thanks,


Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-22 Thread Andres Freund
On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:
 On 16/02/15 10:46, Andres Freund wrote:
 On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
 
 At a quick glance, this basic design seems workable. I would suggest
 expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
 small price to pay, to make it work more like everything else in the system.
 
 I don't know. Growing from 3 to 5 byte overhead per relevant record (or
 even 0 to 5 in case the padding is reused) is rather noticeable. If we
 later find it to be a limit (I seriously doubt that), we can still
 increase it in a major release without anybody really noticing.
 
 
 I agree that limiting the overhead is important.
 
 But I have related though about this - I wonder if it's worth to try to map
 this more directly to the CommitTsNodeId.

Maybe. I'd rather go the other way round though;
replication_identifier.c/h's stuff seems much more generic than
CommitTsNodeId.

 I mean it is currently using that
 for the actual storage, but I am thinking of having the Replication
 Identifiers be the public API and the CommitTsNodeId stuff be just hidden
 implementation detail. This thought is based on the fact that CommitTsNodeId
 provides somewhat meaningless number while Replication Identifiers give us
 nice name for that number. And if we do that then the type should perhaps be
 same for both?

I'm not sure. Given that this is included in a significant number of
recordsd I'd really rather not increase the overhead as described
above. Maybe we can just limit CommitTsNodeId to the same size for now?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-21 Thread Petr Jelinek
Now that the issue with padding seems to no longer exists since the 
patch works both with and without padding, I went through the code and 
here are some comments I have (in no particular order).


In CheckPointReplicationIdentifier:

+ * FIXME: Add a CRC32 to the end.


The function already does it (I guess you forgot to remove the comment).

Using max_replication_slots as limit for replication_identifier states 
does not really make sense to me as replication_identifiers track remote 
info while and slots are local and in case of master-slave replication 
you need replication identifiers but don't need slots.


In bootstrap.c:

 #define MARKNOTNULL(att) \
((att)-attlen  0 || \
 (att)-atttypid == OIDVECTOROID || \
-(att)-atttypid == INT2VECTOROID)
+(att)-atttypid == INT2VECTOROID || \
+strcmp(NameStr((att)-attname), riname) == 0 \
+   )


Huh? Can this be solved in a nicer way?

Since we call XLogFlush with local_lsn as parameter, shouldn't we check 
that it's actually within valid range?

Currently we'll get errors like this if set to invalid value:
ERROR:  xlog flush request 123/123 is not satisfied --- flushed only to 
0/168FB18


In AdvanceReplicationIndentifier:

+   /*
+* XXX: should we restore into a hashtable and dump into shmem only 
after
+* recovery finished?
+*/


Probably no given that the function is also callable via SQL interface.

As I wrote in another email, I would like to integrate the RepNodeId and 
CommitTSNodeId into single thing.


There are no docs for the new sql interfaces.

The replication_identifier.c might deserve some intro/notes text.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-18 Thread Petr Jelinek

On 16/02/15 10:46, Andres Freund wrote:

On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:


At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
small price to pay, to make it work more like everything else in the system.


I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.



I agree that limiting the overhead is important.

But I have related though about this - I wonder if it's worth to try to 
map this more directly to the CommitTsNodeId. I mean it is currently 
using that for the actual storage, but I am thinking of having the 
Replication Identifiers be the public API and the CommitTsNodeId stuff 
be just hidden implementation detail. This thought is based on the fact 
that CommitTsNodeId provides somewhat meaningless number while 
Replication Identifiers give us nice name for that number. And if we do 
that then the type should perhaps be same for both?


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-16 Thread Andres Freund
On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:
 On 02/16/2015 02:21 AM, Andres Freund wrote:
 Furthermore the fact that the origin of records is recorded allows to
 avoid decoding them in logical decoding. That has both efficiency
 advantages (we can do so before they are stored in memory/disk) and
 functionality advantages. Imagine using a logical replication solution
 to replicate inserts to a single table between two databases where
 inserts are allowed on both - unless you prevent the replicated inserts
 from being replicated again you obviously have a loop. This
 infrastructure lets you avoid that.
 
 That makes sense.
 
 How does this work if you have multiple replication systems in use in the
 same cluster? You might use Slony to replication one table to one system,
 and BDR to replication another table with another system. Or the same
 replication software, but different hosts.

It should just work. Replication identifiers are identified by a free
form text, each replication solution can add the
information/distinguising data they need in there.

Bdr uses something like
#define BDR_NODE_ID_FORMAT bdr_UINT64_FORMAT_%u_%u_%u_%s
with
remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
as parameters as a replication identifier name.

I've been wondering whether the bdr_ part in the above should be split
of into a separate field, similar to how the security label stuff does
it. But I don't think it'd really buy us much, especially as we did
not do that for logical slot names.

Each of the used replication solution would probably ask their output
plugin to only stream locally generated (i.e. origin_id =
InvalidRepNodeId) changes, and possibly from a defined list of other
known hosts in the cascading case.

Does that answer your question?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-16 Thread Heikki Linnakangas

On 02/16/2015 02:21 AM, Andres Freund wrote:

Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.


That makes sense.

How does this work if you have multiple replication systems in use in 
the same cluster? You might use Slony to replication one table to one 
system, and BDR to replication another table with another system. Or the 
same replication software, but different hosts.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-16 Thread Heikki Linnakangas

On 02/16/2015 11:18 AM, Andres Freund wrote:

On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:

How does this work if you have multiple replication systems in use in the
same cluster? You might use Slony to replication one table to one system,
and BDR to replication another table with another system. Or the same
replication software, but different hosts.


It should just work. Replication identifiers are identified by a free
form text, each replication solution can add the
information/distinguising data they need in there.

Bdr uses something like
#define BDR_NODE_ID_FORMAT bdr_UINT64_FORMAT_%u_%u_%u_%s
with
remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
as parameters as a replication identifier name.

I've been wondering whether the bdr_ part in the above should be split
of into a separate field, similar to how the security label stuff does
it. But I don't think it'd really buy us much, especially as we did
not do that for logical slot names.

Each of the used replication solution would probably ask their output
plugin to only stream locally generated (i.e. origin_id =
InvalidRepNodeId) changes, and possibly from a defined list of other
known hosts in the cascading case.

Does that answer your question?


Yes, thanks. Note to self and everyone else looking at this: It's 
important to keep in mind is that the replication IDs are completely 
internal to the local cluster. They are *not* sent over the wire.


That's not completely true if you also use physical replication, though. 
The replication IDs will be included in the WAL stream. Can you have 
logical decoding running in a hot standby server? And how does the 
progress report stuff that's checkpointed in pg_logical/checkpoints work 
in a hot standby? (Sorry if I'm not making sense, I haven't really 
thought hard about this myself)


At a quick glance, this basic design seems workable. I would suggest 
expanding the replication IDs to regular 4 byte oids. Two extra bytes is 
a small price to pay, to make it work more like everything else in the 
system.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-16 Thread Andres Freund
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
 Yes, thanks. Note to self and everyone else looking at this: It's important
 to keep in mind is that the replication IDs are completely internal to the
 local cluster. They are *not* sent over the wire.

Well, if you *want* to, you could send the external (free form text)
replication identifiers over the wire in the output plugin. There are
scenarios where that might make sense.

 That's not completely true if you also use physical replication, though. The
 replication IDs will be included in the WAL stream.

Right. But then a physical rep server isn't realy a different server.

 Can you have logical decoding running in a hot standby server?

Not at the moment, there's some minor stuff missing (following
timelines, enforcing tuple visibility on the primary).

 And how does the progress report stuff that's checkpointed in
 pg_logical/checkpoints work in a hot standby?  (Sorry if I'm not
 making sense, I haven't really thought hard about this myself)

It doesn't work that greatly atm, they'd need to be WAL logged for that
- which would not be hard. It'd be better to include the information in
the checkpoint, but that unfortunately doesn't really work, because we
store the checkpoint in the control file. And that has to be =
512 bytes.

What, I guess, we could do is log it in the checkpoint, after
determining the redo pointer, and store the LSN for it in the checkpoint
record proper. That'd mean we'd read one more record at startup, but
that isn't particularly bad.

 At a quick glance, this basic design seems workable. I would suggest
 expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
 small price to pay, to make it work more like everything else in the system.

I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 4

2015-02-15 Thread Andres Freund
Hi,

Here's my next attept attempt at producing something we can agree
upon.

The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.

Other changes:

* Locking preventing several backends to replay changes at the same
  time. This is actually overly restrictive in some cases, but I think
  good enough for now.
* Logical decoding grew a filter_by_origin callback that allows to
  ignore changes that were replayed on a remote system. Such filters are
  executed before much is done with records, potentially saving a fair
  bit of costs.
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication
  identifier).

I also want to quickly recap replication identifiers, given that
in-person conversations with several people proved that the concept was
slightly misunderstood:

Think about a logical replication solution trying to replay changes. The
postmaster in which the data is replayed into crashes every now and
then. Replication identifiers allow you to do something like:

do_replication()
{
source = ConnectToSourceSystem('mysource');
target = ConnectToSourceSystem('target');

# mark we're replayin
target.exec($$SELECT 
pg_replication_identifier_setup_replaying_from('myrep_mysource')$$);
# get how far we've replayed last time round
remote_lsn = target.exec($$SELECT remote_lsn FROM  
pg_get_replication_identifier_progress WHERE external_id = 'myrep_mysource');

# and now replay changes
copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', 
remote_lsn);

while (record = copystream.get_record())
{
   if (record.type = 'begin')
   {
  target.exec('BEGIN');
  # setup the position of this individual xact
  target.exec('SELECT pg_replication_identifier_setup_tx_origin($1, 
$2);',
  record.origin_lsn, record.origin_commit_timestamp);
   }
   else if (record.type = 'change')
  target.exec(record.change_sql)
   else if (record.type = 'commit')
  target.exec('COMMIT');
}
}

A non pseudocode version of the above would be safe against crashes of
both the source and the target system. If the target system crashes the
replication identifier logic will recover how far we replayed during
crash recovery. If the source system crashes/disconnects we'll have the
current value in memory. Note that this works perfectly well if the
target system (and obviously the source system, but that's obvious) use
synchronous_commit = off - we'll not miss any changes.

Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.

The SQL interface consists out of:
# manage existance of identifiers
internal_id pg_replication_identifier_create(external_id);
void pg_replication_identifier_drop(external_id);

# replay management
void pg_replication_identifier_setup_replaying_from(external_id);
void pg_replication_identifier_reset_replaying_from();
bool pg_replication_identifier_is_replaying();
void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time);

# replication progress status view
SELECT * FROM pgreplication_identifier_progress;

# replicatation identifiers
SELECT * FROM pg_replication_identifier;


Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a
SQL reimplementation of replication identifiers and that has proven that
for busier workloads doing a table update to store the replication
progress indeed has a noticeable overhead. Especially if there's some
longer running activity on the standby.

The bigger questions I have are:

1) Where to store the origin. I personally still think that using the
   padding is fine. Now that I have proven that it's pretty simple to
   store additional information the argument that it might be needed for
   something else doesn't really hold anymore. But I can live with the
   other solution as well - 3 bytes additional overhead ain't so bad.

2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do
   we want to store the origin only on relevant records? That'd be
   XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE //
   XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED. I'm thinking of something
   

Re: [HACKERS] Replication identifiers, take 4

2015-02-15 Thread Andres Freund
On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
 Here's my next attept attempt at producing something we can agree
 upon.
 
 The major change that might achieve that is that I've now provided a
 separate method to store the origin_id of a node. I've made it
 conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
 paths. That new method uses Heikki's xlog rework to dynamically add the
 origin to the record if a origin is set up. That works surprisingly
 simply.

 Other changes:
 
 * Locking preventing several backends to replay changes at the same
   time. This is actually overly restrictive in some cases, but I think
   good enough for now.
 * Logical decoding grew a filter_by_origin callback that allows to
   ignore changes that were replayed on a remote system. Such filters are
   executed before much is done with records, potentially saving a fair
   bit of costs.
 * Rebased. That took a bit due the xlog and other changes.
 * A couple more SQL interface functions (like dropping a replication
   identifier).

And here an actual patch.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 268d52cac6bf7fe1c019fd68248853c7c7ae18b1 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 16 Feb 2015 01:22:08 +0100
Subject: [PATCH] Introduce replication identifiers to keep track of
 replication progress: v0.6

---
 contrib/test_decoding/Makefile |3 +-
 contrib/test_decoding/expected/replident.out   |   84 ++
 contrib/test_decoding/sql/replident.sql|   40 +
 contrib/test_decoding/test_decoding.c  |   28 +
 src/backend/access/rmgrdesc/xactdesc.c |   17 +-
 src/backend/access/transam/xact.c  |   64 +-
 src/backend/access/transam/xlog.c  |   34 +-
 src/backend/access/transam/xloginsert.c|   22 +-
 src/backend/access/transam/xlogreader.c|   10 +
 src/backend/bootstrap/bootstrap.c  |5 +-
 src/backend/catalog/Makefile   |2 +-
 src/backend/catalog/catalog.c  |8 +-
 src/backend/catalog/system_views.sql   |7 +
 src/backend/replication/logical/Makefile   |3 +-
 src/backend/replication/logical/decode.c   |   63 +-
 src/backend/replication/logical/logical.c  |5 +
 src/backend/replication/logical/reorderbuffer.c|5 +-
 .../replication/logical/replication_identifier.c   | 1190 
 src/backend/storage/ipc/ipci.c |3 +
 src/backend/utils/cache/syscache.c |   23 +
 src/backend/utils/misc/guc.c   |1 +
 src/bin/initdb/initdb.c|1 +
 src/bin/pg_resetxlog/pg_resetxlog.c|3 +
 src/include/access/xact.h  |   10 +-
 src/include/access/xlog.h  |1 +
 src/include/access/xlogdefs.h  |6 +
 src/include/access/xlogreader.h|9 +
 src/include/access/xlogrecord.h|5 +-
 src/include/catalog/indexing.h |6 +
 src/include/catalog/pg_proc.h  |   28 +
 src/include/catalog/pg_replication_identifier.h|   75 ++
 src/include/pg_config_manual.h |6 +
 src/include/replication/output_plugin.h|8 +
 src/include/replication/reorderbuffer.h|8 +-
 src/include/replication/replication_identifier.h   |   58 +
 src/include/utils/syscache.h   |2 +
 src/test/regress/expected/rules.out|5 +
 src/test/regress/expected/sanity_check.out |1 +
 38 files changed, 1816 insertions(+), 33 deletions(-)
 create mode 100644 contrib/test_decoding/expected/replident.out
 create mode 100644 contrib/test_decoding/sql/replident.sql
 create mode 100644 src/backend/replication/logical/replication_identifier.c
 create mode 100644 src/include/catalog/pg_replication_identifier.h
 create mode 100644 src/include/replication/replication_identifier.h

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 438be44..f8334cc 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -37,7 +37,8 @@ submake-isolation:
 submake-test_decoding:
 	$(MAKE) -C $(top_builddir)/contrib/test_decoding
 
-REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared
+REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \
+	binary prepared replident
 
 regresscheck: all | submake-regress submake-test_decoding
 	$(MKDIR_P) regression_output
diff --git a/contrib/test_decoding/expected/replident.out b/contrib/test_decoding/expected/replident.out
new file mode 100644
index 000..1c508a5
--- /dev/null
+++